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In this Issue 

This issue continues our ooverage of the HP Precision Architecture 
hardware contfibutions. In the March 1987 issue we described the HP 3000 
Series 930 and HP 9000 Model 840 Computers, which were HP's first reali- 
zations of HP Precision Architecture in off-the-shelf TTL technology. In this 
issue we present two more HP Precision Archrtecture computer develop- 
ments that leverage a common VLSI effort using HP's proprietary NMOS-lll 
integrated circuit technology. One of these system processing units has been 
introduced as the HP 3000 Series 950 running HPs MPE-XL commercial 
operating system and as the HP 9000 Model 850S mnning HP-UX, HP's 
version of AT&T's industry-standard UNIX" operating system environment. These are the new 
top-of-the-tine HP 3000 and HP 9000 computers. The other system processing unit, a midrange 
computer, has been introduced as the HP 9000 Model 825 running HP-UX both in a single-user 
workstation environment and as a multiuser HP-UX system. 

The effort described in this issue dates back to 1982 and has been quite large. The VLSI 
development involved 1 2 devices with up to 1 50 thousand transistors per chip and was the largest 
multtchip custom VLSI project HP has undertaken to date. The two computers deschbed share 
eight common VLSI devices but have quite different design centers. The top'0f4he-line SPU 
maximizes performance, memory, and inputoutput capacity in a fairly large data-center style 
enctosure while the midrange SPU is designed for maximum achievable performance in the 
smallest possible package. Yet both are capable of running exactly the same operating environment 
and are compatible members of the HP Precision Architecture computer family 

Because the VLSI development was so significant towards the realization of the two computers 
we have placed the papers describing it in the front of the issue followed by the SPU papers. 

-Peter Rosen hfadt 
Manager, High-performance Systems Operation 



Cover 

Processor boards from the HP 9000 Model 825 Computer (smaller board} and the HP 9000 
Model 850S.'HP 3000 Series 950 (larger board), shown with an unmounted pin-grid array package 
housing an NMOS-ill VLSI chip. 



What's Ahead 

The October issue will complete the design story (begun in August) of the HP-18C and HP-28C 
Calculators, with articles on the calculators' thermal printer, infrared data link, and manufacturing 
techniques. Also featured will be the design of the HP 4948A In-Service Transmission Impairment 
Measuring SeL Another paper will relate Han/ard Medical School's experience with computer-aided 
training in their New Pathway curnculum. and well have a research report on formal methods for 
software development. 



The Nf» joumflP BficouraQei EecnmC^ di«Cu^i)on 0l tfiift tpptfc* preiemed >ri recent arTcSes and will pubkB+i Ig'ftofs eJtpeiClftd io C5& dt interoBi to auT reade's Letie^s rnu&l Ch brfiSl and afa sudpc! 
)(t a^ifjng. Lillftm ihauW be actdressfljj la. Ed'tor. H^cit Packard Jcjurnal 32DO HiIS-^bw Avtirnje. Pflia Auto, GA 94304. US A 



SEPTEMSER 1987 HEWLETT-PACKARD JOURNAL 3 
)Copr. 1949-1998 Hewlett-Packard Co. 



A VLSI Processor for HP Precision 
Arcliitecture 

by Steven T. Mangelsdorf, Darrell M. Burns, Paul K. French, Charles R. Headrick, and 
Darius F. Tanksalvala 



THIS PAPER DESCRIBES the VLSI chip set used in 
the processors of three HP Precision Architecture 
computers: the HP 3000 Series 950 and the HP 9Q0Q 
Models aaoS and 82^. The Series 950 and Model B50S pro- 
cessors are identicaL All of the chips are designed In HP's 
NMOS-III process. ^ NMOS-Iil is a high-performance NMOS 
process with 1.7-// in drawn channel lengths (D,95-/xm eftet> 
tive channel lengths] , 2.5-^m minimum contacted pilch, 
and t;vo levels of tungsten metal llKrit ion. The chips have 
been designed for o worst -case operating frequency of 30 
MHz, although with the static RAMs availaljle for caches 
at present, the Model 85 OS 'Series 950 processor operates 
at 27.5 MHa and the Model 825 processor operates at 25 
MHz. A 272-pin ceramic pin-grid array package was de- 
veloped to support the electrical and mechanical require- 
ments for the chip set (see "Pin-Grid Array VLSI Packag- 
ing/' page 10). 

Overview 

Each processor consists of a CPU, two cache controller 
chips (CCUs). a translation lookaside buffer controller chip 
(TCU). a system bus interface chip (SIU), a floating-point 
interface chip [MIUJ, and three floating-point math chips. 
All chips except the SIU are common to bath computer 
systems. There are two versions of the SIU. The SIU in the 
Model 8 5 OS/Series 950 interfaces to a high-bandwidth G4- 
bit system bus (SMB), and the SIU in the Model 825 inter- 
faces to a medium- performance 32-bit system bus. The 
floating-point math chips are the same NMOS-III chips that 
are used in the HP 3000 Series 930 and HP 9000 Model 840 
Computers.^ 

Fig. 1 shows the block diagram of the processor. The 
diagram is applicable to both computers. The only differ- 
ences are in the SIU chips, the system buses, and the si2!:es 
of the caches and TLBs (translation lookaside buffers). All 
the chips communicate via the cache tins which consists 
of 32 address lines, 32 data lines, and 63 control lines. The 
cache bus protocol is transaction-based; only the CPU or 
the SIU can be masters. 

The CPU has most of the hardware for fetching and 
executing instructions. If is described in the paper on page 
12. 

Each CCtJ responds to cache bus trajisactions for doing 
various cache operations and controls an array of commer- 
cial static RAiMs, The Model 850S/Series 950 has a two-way 
set-associative 128K-byte cache, and the Model 825 has a 
two-way set-associative 16K-b\?te gache. The CCUs also 
perform instruction prefetching and correction of single-bit 
errors in the cache RAM, 

The TCU responds to virtual address cache bus transac- 



tions which require a virtual- to-rea] address translation 
and permission checking. It controls an array of commerciaj 
static RAMs. It generates appropriate traps for illegal 
accesses and TLB misses. 

The SIU interfaces the processor to the system bus on 
which main memory is located, fetches cache blocks from 
memory during cache misses, and performs other system 
interface functions. 

The MIU responds to floating-point coprocessor transac- 
tinns on the cache bus and controls the floating-point math 
chips. It also reports exceptions and trap conditions. 

All chips on the cache bus also respond to various diag- 
nostic instructions for system dependent functions such 
as self-test, 

Cache and TLB 

Cache Function 

The cache speeds up CPU memory accesses by keeping 
the most recently used portions of memory in a high-speed 
RAM array. The array has a large number of rows called 
sets and two colunnis called groups. Each array entry con- 
tains a 32-byte block of memory and a tag identifying the 
block's address. 

Cache operation is shown in Fig. 2. When the CPU access- 
es memory, the low-order address bits determine which of 
the many sets the data may be in. The tag of each of the 
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Fig. 1. HP Precision Architecture VLSI processor bfock dia- 
gram. The VLSI chtps are the central processing unit (CPU), 
the cache control units (CCU), the TLB control unit (TCU). 
the system bus interface unit (SiU), the fioattng-potnt math 
interface untt (MHJ), and three floating-point math chips (ADD, 
MUL Dtvy 
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Fig . 2. Cac/ie snd TLB operation . 

two entries in this set is compared against the address. If 
a hit occurs, the data can be quickly transferred to or from 
the CPU. If a miss occurs, one of the two blocks in the set 
is randomly selected to be removed to make room for the 
new block. This block is written back to main memory if 
its tag indicates it is dirty (i.e.. il has been modified since 
being moved into the cache]. The new block is read from 
main memory and placed into the vacated entry. The data 
is then transferred to or from the CPU. Cache misses are 
managed by the SEU and are not visible to software. 

Hits are much more common than misses because pro- 
grams tend to access a cached block many times before it 
is removed. For example, the expected miss rate of the 
Model 850S/Series 930 is 1,5%, The effective memory ac- 
cess time is therefore very close to that of tlie cache itself. 
This can be made quite short since the cache is small 
enough to be implemented with high-speed R.'\Ms and is 
physically close lo the CPU, The cache access time of the 
Mode! 8 5 OS/Series 950. for example, is abont 65 ns mea- 
sured at the pins of the CPU, 

TLB Function 

The main function of the translation lookaside buffer 
(TLB) is to translate virtual addresses to raal addresses (Fig. 
2), The TLB determines the real page number mapped to 



by the space ID and the virtual page number- The real page 
number concatenated with the invariant offset field gives 
the 32^bit real address. The page size is 2K b\^es. HP Pre- 
cision Architecture allo^vs 64 -bit virtual addresses, but 
these computers implement only 4S bits to economize on 
hardware. This is more than adequate for cuxrent software 
needs. 

The TLB also verifies that the current process has permis- 
sion to access the page. Two major types of checking are 
defined by HP Precision Architecture. Access rights check- 
ing verifies that the process has permission to access the 
page in the requested manner (read, write, or execute) and 
thai its privilege level is sufficient. Protection ID checking 
verifies that the page's 15-bit protection ID matches one of 
the four protection IDs of the process. A trap is issued if a 
violation is found. 

The TLB cannot possibly store the real page number and 
permission information for every virtual page since there 
are 2 *' of them. Instead* it stores the informfition for the 
mast recently used pages and operates much Hke the cache. 
The TLB has only a single group instead of two, Each entry 
contains the real page number and permission information 
plus a tag indicating the page's space ID and virtual page 
number. Half of the entries are dedicated for instruction 
fetches and half for data accesses. 

When the CPU performs a virtual memory access, the 
address bits determine which of the many entries the page 
may be in. This entry's tag is compared against the address. 
If a hit occurs, the real page number is sent to the cache 
and the permission checks are performed. If a miss occurs^ 
the CPU is interrupted with a trap. The trap handler inserts 
the required entry into the TLB. replacing the previous 
entry. The trap handler returns to the instruction that 
caused the miss: this time, a hit will occur. While misses 
are serviced by software, they are not visible to the program- 
mer. 

Organization 

The cache and TLB system organization is shown in Fig. 
T Each of the two cache controller units (CCUs) controls 
one of the two groups in the cache array. Similarly, the 
TLB controller unit (TCU) controls the TLB array, The 
cache and TLB arrays are implemented with commercial 
CMOS static RAMs. These provide excellent speed and 
density while eliminating the design effort and risk asso- 
ciated with an on-chip RAM array. 

Operation of the cache system is best understood by con- 
sidering a typical virtual-mode load cache bus transaction 
as shown in Fig. 3. The transaction begins with the trans- 
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mission of the space register number from the CPU to ihe 
TCU during clock 2 of state 1. The TCLl dumps the space 
ID oul of the selected space register, and the virlueil [jage 
number arrives during clock 2 of state 2. The TCU forms 
the TLB RAM address from the space ID and virtual page 
number and drives it out to the RAM array during tlie same 
clock 2. One bit of the address is set by whether the trans- 
action is for an instruction fetch or data access; this effec- 
tively partitions the TLB array into two halves as discussed 
above. At the end of clock 2 of state 3, the RAM data is 
valid. The real page number field goes to the tvto CCtJs for 
their tag compare. The other fields go back to the TCU for 
its tag compare and for permission checking, A miss or a 
permission violation causes the TCU to issue a trap during 
clock 1 of state 3. 

The CGUs process the transaction in paralleh They re- 
ceive the address during clock 2 of state 2 and drive it oul 
to their RAM during tke next clock 1 . At the end of clock 
2 of state 3, the RAM data is valid along with the real page 
number from the TLB RAM. A tag compare is then per- 
formed. If a hit occurs, tlie CCU drives the data oul onto 
the cache bus during clock 1 of state 3h If a miss occurs, 
all the GCUs allow the bus to float high, which is interpreted 
as a miss by the other chips. 

It is important to note that the CCUs can address their 
RAM with the virtual address before the real page number 
has arrived from the TLB, greatly reducing overall access 
time. This is permitted because HP Precision Archilecture 
specifies a one-to-one mapping between virtual and real 
addresses, and requires software to flush pages before their 
mapping changes. This is a very important contribution of 
HP Precision Architecture over other architectures, which 
allow only the offset portion uf the address to be used for 
efficient cache addressing. 

For real-mode transactions, the TCU does not read its 
RAM. Instead, the TCU receives the real page number from 
the cache bus and drives it to the CCUs through TTL buffers. 
This simplifies the CCU by making real and virtual accesses 
look alike. Note that the TCU does not connect to the real 
page number RAMs. The CCUs and the TCU cooperate 
during transactions such as TLB inserts where the real page 
number is sent over the cache bus. 

Stores are similar to loads, except that the CCU writes 
the new tag and data into the RAM during clock 1 of state 
3. Byte merge is performed by combining the old and new 
data words. If a miss occurs, the write is not inhibited, but 
the old tag and data are written instead of the new. When 
a trap occurs such as a TLB miss, the indication comes too 
late on the cache bus for this. Instead, states are added to 
the transaction to write the old tag and data back, 

J n St ruction Fetch Timing 

The load transaction described above is effectively two 
states long since its first state can overlap with the last 
state of the previous transaction, and it causes the CPU to 
interlock for one state. This interlock %vould also occur if 
instruction fetch transactions required two states and 
u^ould cause an excessive performance loss. Therefore, a 
prefetching algorithm is implemented by the CCU to short- 
en fetch transactions to one state and eliminate the inter- 
lock in almost all cases. 



There are three types of fetch transactions: sequential, 
branch, and random. They all have one thing in common: 
at the end of the transaction, the CCU prefetches the next 
sequential instruction and saves it in an internal register. 

A sequential fetch transaction fetches the instruction that 
follow.^ the previous one. The CCUs can simply dump out 
tiie prefetched instruction Inim their rtigisters. The transac- 
tion is only one stale long and does not cause an interlock. 

A branch fetch transaction fetches the target of a taken 
branch instruction. The branch target address is sent from 
the CPU to the CCUs in a prefetch target transaction issued 
on the previous state. This gives the CCUs a one-state head 
start in fetching the branch target, so the fetch transaction 
is only one state long and does not cause an interlock. The 
CPtJ must send I he prefetch target before it know^s whether 
the branch is taken. tJntaken branches result in prefetch 
targets fallowed by sequential fetches, 

A random fetch transaction is issued for the first instruc- 
tion after a reset, the first instruction of any interruption 
handler, or the first or second instructions after a return 
from interruption instruction. The fetch transaction is two 
states long and causes a one-state interlock, but these cases 
are very rare. 

The TCU uses a similar prefetching algorithm. However, 
it is not required to prefetch the next sequential instruction 
on every fetch transaction. Since all instructions from the 
same page have the same translation and permission infor- 
mation, the real page number and permission check results 
are simply saved and used for subsequent sequential 
fetches. Whenever a sequential fetch crosses a page bound- 
ary, a one-state penalty is incurred while the TLB is reac- 
cessed to determine the real page number and check per- 
mission for the new page. It is also necessary to reaccess 
the TLB when an instruction is executed that might change 
the real page number or permission check results. As a 
resuh of this simplification, the TCU RAM is cycled at 
most every other state [the CCU RAM is cycled every state]. 

Design Trade-Offs 

HP Precision Architecture does not require the cache or 
TLB to be flushed after each context switch like many other 
architectures. Therefore, it can never hurt performance to 
make the arrays larger, provided that the clock rate is not 
affected. Each CCU can control 8K, IBK, 32K, or 64K bytes 
of data, and each TCU can support 2K or 4K total entries- 
Supporting larger arrays would not have been difficult ex- 
cept that pins for additional address lines w^ere not avail- 
able. The Model aSOS/Series 950 has 64K b>^es of data per 
CCU (1 28K bytes total] using 1 6K x 4 RAMs, and a 4K-en!ry 
TLB using 4Kx4 RAMs. The Model 825 has 8K bytes per 
CCU (1(3K bytes total] using 2Kx8 RAMs. and a 2K'entry 
TLB also using 2Kx8 Ri\Ms. 

The number of groups (columns) iii the cache array is 
called the associativity. Increasing the associativity tends 
to decrease the miss rate. This is because wnder sets (rows) 
decrease the chance that several heavily used blocks will 
compete for a set that cannot hold them all. The improve- 
ment is substantial going from an associativity of 1 to 2, 
but further increases bring little improvement. Unfortu- 
nately, increasing associativity generally tends to increase 
cache access time because data from the groups must be 
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multiplexed together following the tag compare- In these 
processors, access time is limited by parity decoding, 
which is perfomied in parallel with the multipiexing. 
Therefore, we decided to support from one to four group s* 
each implemented by a CCU and its associated RAM, The 
Model 8 5 OS/Series 950 and Model 825 both have a cache 
associativity of 2. However, the TLB associalivlt}' is fixed 
at 1 because the improvement in miss rate did not justtf\' 
the increase in access time and cost. 

When the cache associativity is greater than 1, it becomes 
necessary to decide which of the entries in the set is to be 
replaced when a cache miss occurs. The performance ad- 
vantage of a least recently used strategy over a random 
strategy' was too small to justif>^ the extra implementation 
cost. With the random st^ateg>^ each group is selected on 
every Nth miss where N is the associativity. 

The cache block size is fixed at 32 bytes. Larger blocks 
tend to decrease miss rate up to a limit, but cache misses 
require more time to service. The block size was chosen 
to minimize the performance penalty for cache misses, 
which is the product of the miss rate and the miss service 
time. 

The cache uses a w^rite-back store policy. This means 
that system memory is not updated on a store until the 
block is eventually replaced from the cache or explicitly 
flushed by software. Because a block can be stored to many 
times before being replaced, a write-back policy minimizes 
system bus traffic, which is critical in multiprocessor sys- 
tems. 

The same cache is used for both instruction fetches and 
data accesses. Separate caches are permitted by HP Preci- 
sion Architecture and have tremendous advantages when 
the CPLl can issue both an instruction fetch and a data 
access concur rent ly^ However, the CPU always interleaves 
these operations on the cache bus, and so separate caches 
offered no advantage for us. One benefit of a unified cache 
is that it generally has a significantly lower miss rate than 
a split cache of the same size, particularly when the as- 
sociativity is greater than L 

For the TLB, however, half the array is allocated for 
instructions and half for data. This prevents thrashing, the 
situation where a heavily used inslruclion page and a heav- 
ily used data page compete for the same entry. Also, han- 
dling TLB misses in software requires that the TLB be able 
to hold an arbitrary instruction and data page concurrently. 

Address hashing is used to improve TLB performance. 
Instead of using the low-order bits of the virtual page 
number to select a TLB entr>', an XOR of virtual page number 
and space ID bits is nsed. This ensures that low-numbered 
pages from all the spaces do not map to the same entry- 
This minimizes thrashing because these pages tend to be 
frequently used- According to preliminary studies per- 
formed by our System Performance Laboratory, this im- 
proves instruction TLB performance by a factor of 3 and 
data TLB performance by a factor of 7. 

Error Correction 

Several features are included to increase product reliabil- 
ity and availability. The CCU implements detection and 
correction of all single-bit errors in the cache array. Each 
data word and tag are protected by IB-bit harizontal parity. 



The CCU also maintains the reference vertical (column] 
parity in an internal register that is updated on aJJ wxites. 
Copy-in transactions require special handling because they 
are vixite-only rather than read'modif\''-WTite, The CCU ac- 
cumulates the vertical parity of the replaced block during 
copy-out transactions (for dirt^^ misses) or autonomously 
(for clean misses) before the SlU issues the copy -in trans- 
actions. 

When a parity^ error is detected during a read, the CCU 
hangs the cache bus and initiates error correction. It walks 
through the entire array (up to 16K locations) to determine 
the actual vertical parity, and compares this with the refer- 
ence vertical parity in its register. The bit position in which 
a mismatch occurs Indicates which bit of the data should 
be flipped. Error correction occurs without software inter- 
vention. 

After each correction, the CCU issues a low- priority 
machine check to the CPU so that software can log the 
error and test the location for hard errors. If a hard error 
is found, an individual entry or an entire group of the array 
can be mapped out so that processing can continue (with 
degraded performance) until the machine is repaired. 

The TCU provides the same protection against single-bit 
errors but in a simpler way, As in the CCU. 16-bit horizontal 
parity is used to detect errors. When an error is detected, 
the TCU issues a high-priority machine check to the CPU. 
The high-priority machine check handler purges the bad 
entry from the TLB. Upon return from tlie high-priority 
machine check handler, a TLB miss trap will occur^ and 
the entry will be inserted into the TLB by the normal T!.B 
miss handler. This simple scheme works because HP Pre- 
cision Arch i lecture never requires TLB entries to be dirtv- 

The high-priority machine check handler can also log 
the error and test the entrv for hard errors. If a hard error 
is found, one half or three quarters of the TLB can be mapped 
out so that processing can continue despite degraded per- 
formance. This is implemented by simply freezing one or 
two of the RAM address bits. 

Internal Circuits 

Internal circuits on the TCU and CCU chips are designed 
to run at 30 MH/. worst-case. The chips provide 25-Mfi;^ 
worst-case operation with 25-ns RAMs and 27,5-MHx 
worst'Case operation with 22-ns RAMs. 

Writes into the cache array require only one state. A 
split -phase loop is used to generate timing edges for the 
NCE and NWE RAM control signals and for the data driver 
enable and tristate (see '*A Precision Clocking System/' 
page 17). 

Not surprising) y^ transfers of data and addresses to and 
from the RAMs provide some of the tightest timing paths 
on these chips. For instance, full 16-bit parity encode and 
decode must be completed in 7 ns without the advantage 
of clock edges. This is accomplished by using a tree of XOR 
gates constructed from analog differential amplifiers of the 
type shown in Fij^. 4, This circuit requires two true/comple- 
ment pairs as Iciputs to produce one output pair. The small 
signal swing thi.s circuit requires allows the result to prop- 
agate ihnnigh the parity tree fast enough to meet its 7-ns 
budget. The analog differential voltages at the output of 
the tree are converted to digital voltages by a level trans- 
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D - Depletion Type MOSFET 
Fig. 4. Analog differentiai ampiffm used as an XOR gate. 

lator. 

TLB address hashing has been implemented for im- 
proved TLB performance. Since the address drive time is 
in tl:ie TLB's critical timing path, the hashing circuils need 
to be very fast. Fast liashing is accompiished by usinj^ the 
fact th.at the space ID bits of the virtual address and the 
control data are available early to set up transff^r FKTs. The 
circuit simply steers the late-arriving virtual page number 
bit lo the positive or negative input of the address pad 
drivers, depending on whether the earty space ID bit was 
a or a 1 . This logicaily XORs the space ID and virtual page 
number bits with a minimal delay of only one transfer gate. 

Three types of control structures are used to provide 
different combinations of speed and density. Dynamic 
PLAs (programmable logic arrays) with two levels of logic 
and a latency of two clock phases are used for the majority 
of chip control, and smaller two-leval static PLAs wilh one 
phase of latency are reserved for more speed-critical con- 
trol. PL A outputs can qualify each other and be wire-ORed 
together, effectively providing third and fourth levels of 
logic. It is also possible for static PLAs to generate the bit 
lines for dynamic PLAs to give an extra two levels of logic. 
Where speed is essentiaL a few random logic structures 
are used. 



System Interlace Unit 

The function of the system interface unit (SIU) is to in- 
terface the processor cache bus to I he SPLI system bus so 
that it can comniunicate ivitb main mamor>^ and L'O- Since 
the price/ performance goals of the different computers 
could only be met with two different system bus defini- 
tions, two separate SIUs were designed. Sll.JF interfaces to 
the 32-bit MidBus used in the Model 825. SIUC inlerfaces 
lo the 54-blt SMB used in the Model 850S/Series 950. The 
two SIUs are very similar except that the SIUC implements 
multiprocessor cache coherency algorithms. 

Most of the functions of the SILI involve block transfers 
to and from the CCLJs over the cache bus. A typical cache 
miss sequence is shown in Fig. 5. It begins with a load or 
store cache bus transaction in which both CCUs signal a 
miss. The CCUs send the real address of the requested 
block to the SIU so that it can begin arbitrating for a read 
transaction on the system bus. The GCU that is selected 
for replacement then sends the real address of the block 
to be replaced along with an indication of w^hether the 
block has been modified. If so^ the SIU issues eif^ht copy-out 
cache bus transactions to transfer the replaced block to its 
write-back buffer. 

Sometime after this is completed, the requested block 
will begin to arrive on the system bus. The SIU issues eight 
copy-in cache bus transactions to transfer it to the CCU. 
Finally. the.SIU issues a restart cache bus transaction to 
end the miss sequence, and the CPU resumes issuing trans- 
actions. For a load miss, the requested data is transferred 
from the CCU to the CPU during the restart transaction. 
The miss penalty is 27 instructions for the SIUF and 16.5 
instructions for the SIUC. 

The SIU will arbitrate for a wTite transaction on the sys- 
tem bus lo empty its write-back buffer as long as no higher- 
priority operation is pending. The SIUC has a two-entry 
write-back buffer for maximum performance. 

HP Precision Architecture defines an uo cached region 
of the real address space called I/O space. When a load or 
store transaction accesses I/O space, the CCUsahvays signal 
a miss. The SIU issues a single- word read or write transac- 
tion on the system bus since no block transfers are involved, 
in systems using SIUF, the processor dependent code ROM 
(which contains boot code) is also uncached. The SILIF 
accesses this ROM in a byte-serial manner to reduce cost. 

The SIU contains some of the HP Precision Architecture 
control registers. This results in better system partitioning 
and reduces CPU chip area. On the SIU chip are the tem- 
porary registers, the interval tinier, and the external inter- 
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rupt request and mask registers. 

The SIU detects a power- up or hard reset operation on 
the system bus and generates cache bus signals to initialize 
the olher chips- It intermpts the CPU when a powerfail is 
detected or an external interrupt is pending. Checking is 
performed for system bus parity errors and protocol viola- 
tions and errors are reported to the CPU through the high- 
priority machine check. 

HP Precision Architecture requires the SIU to have a few 
registers accessible to other devices on the system bus for 
address relocation and receiving interrupts and resets. Ad- 
ditionaj registers are used for logging s>^steni bus errors 
and other implementation dependent functions. 

Many other architectures require the SIU to query the 
cache during DMA transfers to ensure consistency with 
main memor3^ With HP Precision Architeclore, it is soft- 
ware's responsibility to flush the affected parts of the cache 
before starting a DMA transfer. This results in a consider- 
able savings of hard^vare and complexity. 

The processor and memory subsystems of the Model 
850S/Series 950 are designed to implement hardware al- 
gorithms to ensure full cache and TLB coherency between 
multiple symmetric processors as required by HP Precision 
Architecture. Whenever a memory access occurs, all pro- 
cessors must ensure that the processor doing the access 
acquires an accurate copy of the data. 

Each 32-byte block in the cache is marked either clean 
and private, clean and public^ or dirty. If a block is dirty 
or clean and private, this is the only copy of the block and 
can be modified without communicating with other proces- 
sors. If the block is clean and public, it can be read but not 
written. If the block is absent, it is requested from the 
memory subsystem. 

During the return operation on the SMB, all other proces- 
sors check their caches for copies of the requested block. 
If a checking processor discovers a dirty block, the return 
operation is aborted, the dirty block is flushed to memory, 
and the return operation is repeated with the correct copy 
of the data. If a checking processor discovers a clean and 
public or clean and private copy^ it wili either delete it or 
change it to public depending on whether the requesting 
processor wants a private copy, Simiiar algorithms are used 
to maintain coherency during the load, store, load and 
clear, purge cache, and flush cache operations. 

Tn maintain TLB coherency in a multiprocessor system, 
the purge TLB instruction is broadcast on the SMB to all 
processors so that all TLBs are purged ,simullaneou.sly. The 
SIU maintains coherency if a purge rLB occurs on the SMB 
at the same time that the TLB entry is being used to access 
data from the memory subsystem during a cache miss. 

Fioating-PoJnt Coprocessor 

HP Precision Architecture allows coprocessors to pro- 
vide hardware assist for complicated operations. Coproces- 
sors have their own set of registers, and once a coprocessor 
operation is started, the coprocessor can process that in- 
struction concurrenlly with the CPU. Floating-point oper- 
ations are well-suited for coprocessing. In general, flaating- 
point operands never need to be operated on by the CPU's 
integer ALU, so dedicated floating-point registers keep the 



general registers free. Floating-point operations also tend 
to take many cycles to complete. While the coprocessor is 
working on completing a previous operation, the CPU can 
continue. 

The architecture requires the coprocessor to implement 
fuU XEEE-compatible floating-point functionaiitv^ The co- 
processor has sLxteen 64 -bit registers, of which twelve are 
floating-point registers and four are status and exception 
condition registers. Single, double, and quad precision op- 
erations are defined by the instruction set. The floating- 
point hardware actually implements a subset of the stan- 
dard. The coprocessor redirects to software emulation 
routines any operations that cannot be handled tn hard- 
ware. Trapping to software only occurs on infrequent oper- 
ations and exceptional conditions and thus has a minimal 
impact on performance. 

The floating-point coprocessor is implemented by four 
chips: the math interface unit (MIU) and three proprietan,^ 
HP floating-point chips. The three floating-point chips are 
an adder, a divider, and a multiplier. These same floating- 
point units are used in the HP 900D Model 550, the HP 
9000 Model 840, and the HP 3000 Series 930 Computers. 

Math Interface Unit 

Review of the math chips' capabilities showed the benefit 
of a simple MIU design. The math chips can be configured 
to allow pipelining on the chips: for example* the add chip 
can be configured to perform up to five simultaneous inde- 
pendent adds. This increases throughput in those special 
cases in which many Independent adds are needed. How- 
ever, the penalty for taking this approach is that the overall 
latency is significantly increased and scalar use of the co- 
processor suffers. The determination was made to only 
allow one floating-point operation at a time to be executed 
by the math chips. This has the double benefit of decreasing 
latency and keeping the MIU simple* The result is a ver>' 
clearly defined and well -partitioned chip that needs a 
minimum of special circuitry. 

The MIU interacts with the CPU over the cache bus. The 
CPU processes floating-point instructions In its pipeline 
like other instructions. The CPU determines cache ad dress- 
es, branch and nullify conditions, etc., before issuing the 
transaction to the MIU. When the transaction is Issued, the 
cache and TLB report trap conditions and the MIU begins 
processing the instruction. The cache bus protocol is very 
flexible, allowing data and trap information to he sent dur- 
ing any state nf a transaction. Transactions can also be 
extended to any length. This interface to the CPU simplifies 
the communication between the CPU and the MIU* since 
it removes the requirement that the MIU have knowledge 
of the CPU pipeline. 

The MIU allows operations to overlap with independent 
loads and stores. Because of this fe;ature> the MIU includes 
on-chip interlock hardware that checks for interlock condi- 
tions and responds properly. Interlock conditions th;il rnusi 
be recognized include attempting to start an operation he- 
fore the previous one has completed, attempting to read a 
register that is the destination of an executing operation, 
attempting to load either the sources or destination of an 
executing cjperalion, and attempting to read the status regis- 
ter while an operation is executing^ In each of these cases, 
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Pin-Grid Array VLSI Packaging 



A significant part of the design of the VLSI processor for the 
HP 3000 Senes 950 and HP 9000 Modeis S50S and 825 Comput- 
ers was the method of packaging ihe VLSI chips. A single pin-gnd 
array (PGA) package was ctesfgned with enough fle?<]bFlity and 
elect rjcaJ and thermal performance that all sjx of the iCs on the 
processor board are able to use it. This same package !s afso 
used by three other custom VLSI circuits in the systems. 

The package design was constrained by the union of the re- 
quirements oi alt the chips that were to use the package 

■ 272 pins, some dedicated to a common ground and others 
dedicated to three mam power supplies. Additional flexibility 
for separating noisy supplies from internal logic supplies was 
also required. 

■ A diflerent chip pad out for each of the nine tCs that were to 
use the package 

■ Ceramic package technology for high assembly ycetd and high 
reiiability. 

■ Adherence to geometries that can be manufactured using 
thick-film ceramic technology and assembled using conven- 
I tonal wire- bonding technology 

■ Support of high-power ( 1 2W dissipation) VLSI NMOS circuits 

■ Consistent with through-hole mounting technology on conven- 
tional printed circuit boards. 



The 272-pin PGA package is shown in Fig 1 The design 
makes full use of state-of-the-art muftilayer ceramic technology 

The PGA has six metallization layers, three of which are ground 
planes. A fourth layer is a mixed ground and power plane (this 
plane also acts much like an ac ground) Two layers contain all 
the signat Ipnes. Each of these two layers has a ground plane 
above and below it, ensuring a transmission-line environmenl 
Connections between layers are provided by vias punched m 
the ceramic These vias also ensure ground plane and power 
supply integrity. 

A few changes in the conventional multilayer ceramic process 
were required First, a heat sEug made of 90% copper and 10% 
tungsten ;s provided for attachment of the \C This results in a 
thermal resistance of 9°C,AA/ from junction to heatsmk. com- 
pared with approximately 3 to 5^0 fW for conventional die atlach- 
ment for the same IC and package Sfze Second, considerable 
flexibility ts provided for pad assignments between ground, 
power, and signals Over 90% o! the pads have more than one 
bonding option, and 10% of these have three bonding options. 
Third, there are three tiers of wire-bond pads on the ceramic 
and two staggered rows of bond pads on the IC This provides 
not only shorter, more dense bonding of the signals and optional 
power supplies, but also a way to achieve very short wj re-bond 



the MIU wilt hang the processor until the interlock is re- 
solved, and then continue. Allowing noninterlocked loads 
and stores gives the compilers the ability to improve perfor- 
mance significantly with optimized code. 

The MIU interacts with the math chips across the math 
interface bus. The MIU is a complete master of this hus. 
controlling all the load, unload, and operation timings. 

Counters control the number of cycles needed before 
unloading the result from a math chip. These counters are 
available to software through an implementation depen- 
dent instruction. Since the floating-point math chips' cal- 
culation times are independent of any clocks, the number 
of cycles needed for an operation to coinplel:e varies with 
frequency. Giving software the capability to control the 
cycle counts allows the coprocessor to run as efficiently 
as possible at any frequency by changing the number of 
count cycles as frequency changes. As a debugging feature, 
another set of machine dependent commands allows soft- 
ware to take direct control of the math bus to facilitate 
system debugging involving math chips. 

Cache Bus Electrical Design 

The six chips in the processor commnnicate via a collec- 
tion of 127 signals known as the cache bus. Each chip 
connects only to those signals necessary for its fund ton. 

The bus operates in a precharge/pulidown manner. Each 
signal is designated as either clock 1 or clock 2. Clock: 1 
signals transmit data during clock 1 and precharge during 
clock 2, and clock 2 signals transmit data during clock Z 
and precharge during clock 1 . Each cache bus signal trans- 
mits one bit of information per state. 

During the precharge phase, all chips connected to a 



signal help drive it to a high value {2.85 V nominal). During 
tlie transmit phase, if a chip wishes to assert a ^ero, it turns 
its driver on and pulls the signal low. Hit wishes to assert 
a one, it allows the signal to remain high. Any chip con- 
nected to a signal may assert a zero independently of tiie 
other chips. This wired -AND logic simplifies the functional 
protocol. 

Because each cache access requires a transfer from the 
CPU to the CCUs and back, short cache bus transmission 
delays are essential to maximize the processor clock fre- 
quency. This was achieved through careful design of the 
drivers, printed circuit hoard traces, receivers, and clocking 
system. 

The drivers consist of a large pulldown transistor and a 
source bootstrapped predriver stage which can apply a full 
5V signal to the gate of the pulldown transistor. The delay 
through the driver is only 2.6 ns worst case. The precliarge/ 
pulldown bus results in area-efficient drivers, since the 
puilup consists only of a precharger. This can be small 
because several chips plus the terminating resistor help 
pull each signal high and precharge has an entire phase in 
which to complete. 

The chips are positioned along a linear bus on the board. 
There are never more tlian two traces connected to a pin» 
so the driver never looks into an impedance lower than 
two parallel traces of 50O each when driving the initial 
edge. The linear bus also reduces undesirable reflections 
by minimizing impedance mismatches. Considerable atten- 
tion was given to minimizing the length of traces, particu- 
larly those knovtm to be critical. This resulted in a worst- 
case pin-to-pin propagation delay of only 5.7 ns. 

The receivers are of the zero-catching variety. They con- 
sist of bistable latches that are reset during the precharge 
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lengths (te&s than 0.045 Inch) from the pad on the tC directly to 
a ground pfane on rhe PGA A constraKnt on the IC designers 
was that power supply and ground pads could be foe a ted only 
at certain prespecified Jocations on the chip 

The electrical requrrements for the package were quite aggres- 
sive because of the large nurnber of I/O connections and thetr 
high switching speeds Minimizmg power supply and ground 
noise was exiremely rmportant. It was also essential to provide 
a transmisston-lEne envtronmenr Jor the signal lines. The metalli' 
zation that serves as signal routing is rn a stripline configuration 
This configuf alien minimizes signal-to-signai coupling (cross- 
talk). Power supply routing is designed to minimize the path 
inductance and maximize the capacitance, thereby providing 
Clean power supplies to the IC The PGA can have up to thtrteen 
independent power supphes which are isolated on the package 
and independently bypassed to ground on the package, The 
ground pEanes are shorted together through vias to provide a 
clean ground connection to the IC. 



Electrical models for the 272-pin PGA package were extracted 
directly from the artwork in fieu of the challenge of verification 
by direct measurement, these models were verified by comparing 
measurements from a CPU in a test fixture with Sptce simulations 
that modeled the CPU, PGA, and printed circuit board environ- 
ment, Good correlaiion between the measurements and simula- 
tion results was obtained Ttiese models were then used by the 
designers for worst-case system modeling. 
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phase and set during the transmit phasG if their input falls 
below 1,3V for 2.9 ns or longer. Analog design techniques 
were used to raise and tightly control the trip level. Once 
tripped, the receivers do not respond to noise or reflections 
on the bus. This increases noise margin for receiving a zero 
attd effectively reduces propagation delay. The delay 
through the receiver is only 2,9 ns worst case* 
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HP Precision Architecture NIVIOS- 
Single-Chip CPU 

^y Jeffry D.Yetter, Jonathan P. Lotz, William S. Jaffe, Mark A. Forsyth, and Eric R. DeLano 



LnCE ALL OF THE CUSTOM VLSI chips designed for 
the HP 3000 Series 950 and HP ^)000 Models 950S and 
825 SPUs. the CPU is designed using HPrs NMOS-IU 
fabrication process.^ MMOS-IH was a natural choice not 
only because it affords the density and speed required for 
the design. Ijul also becau.s&i of its pra%^en manufacturabil- 
ity. The CPU chip contains 115i000 transistors packed onto 
a square die measuring 8.4 mm on a side. 

Using the NMQS-Ill technology and a variety of high -per- 
formance circuit design techniques, the CPU design 
achieves a 30'MIIz worst-case operating frequency. This 
exceeds the system clock frequency required by the Model 
850S/Series 950 and Model 825 SPUs, allowing the CPU 
chip to have higli yield and operating margins in those 
environments. 

The ciiip implements the entire set of 140 instructions 
and 25 trap and interruption conditions of HP Precision 
Architecture. Instruction execution is pipelined, and 130 
of the 140 instructions execute in a single machine cycle. 
Instruction sequencinj^ and CPU control are ini piemen ted 
in hardwired logic, rather than in microcode. Hardwired 
control ensures that instructions are executed in the fewest 
possible cycles and at the highest possible frequency. 

The key design goals were to produce a manufacturable 
HP Precision Architecture CPU design optimized for sus- 
tained performance in a realistic high-performance com- 



Data Bus 



putirig environment. Circuit design efforts were concen- 
trated on the criticei timing paths. Within the CPU chip, 
the execution unit presented some of the most critical speed 
design challenges. The execution unii produces 32-bit re- 
sults for the arithmetic, logical, extract, end deposit instruc- 
tions (see 'Execution Unit," next page). Many critical paths 
extend off the chip through its cache bus interface. Thase 
paths are optimized to achieve the highest overall system 
performance.^ 

Fundamental to tlie performance of the CPLI chip is a 
pair of high-drive clock buffers which produce a pair of 
nonoverlapping clock pulses [see **A Precision Clocking 
System," page 17). Each clock signal supports a fanout of 
over 8,000 gates with a single level of buffering, a feat that 
is only possible within the confines of an integrated design. 
Each component of the CPU derives its timing i n form lU ion 
from these clock signals, yielding a tightly coupled syn- 
chronous CPU design. 

Architecture Overview 

HP Precision Architecture specifies a register-based CPU 
architecture in which the registers are the source of all 
operands and the repository for all results. There are 32 
general registers, which are used for local storage of 
operands, intermediate results, and addresseSn Instructions 
interact with the memory hierarchy via the LOAD and STORE 

Icorrt nufid on ^age 14) 
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Fig. 1. HP Pfecision Architecture VLSf CPU block diagram. 



12 HEWLErr-PACKARD JOURNAL SEPTCMBEF 1S&7 



)Copr. 1949-1998 Hewlett-Packard Co. 



Execution Unit 



The execution unit (E-unit) is the compLiiaiion engine of the 
CPU As its name implies, the E-unit provides the CPU functtons 
at the EXECUTE pipefine stage Irs pnmary function js the compu- 
taiion of re&utts for the data transformation and branch mstfuc- 
tiQOs Data transformation instructions act on operands selected 
from th& general registers arKj produce results which are stored 
back into the general reg[Siers Once a result has been com- 
puted, the E'un^t must d^ermine whether the condition for 
branching or nullificaiion has been met ' The E-unit contains 
special hardware which evaluates its result to determine the test 
condition The Enjnit hardware is outlined in Fig 1 

Data transformation instructions are of two types: arithmetic/ 
logical and shift/merge The E -unit contains two specialized com- 
ponents to compuie its results: the arithmetic logic unit (ALU) 
and the shift merge unit (SMU) 

The challenge in the ALU design is to provide quick execulion 
of a 32-bit add function, This is the ALU function that requires 
the most gate delays Many techniques exist to exploit parallehsm 
to minimize gate delays for addition, but these tend to produce 
adder designs that are inordinately large Since the same essen- 
tiaf adder that sen/es the ALU is replicated on the CPU to provide 
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the branch adder program counter, and recovery counter, such 
techniQues were iargeiy excluded from thjs design, A relatively 
simple ALU organization was adopted, and NMOS circuit tech- 
niques are employed to provide the necessary speed 

The adder \s organized into eight identEcal 4 -bit short carry 
stages which controJ an fl-bii-long carry chain The jniermediate 
carrfes and the operands are then combined by logic to produce 
the 32-bJt result Sixteen gate defays are requrred to transform 
the operands into a result by this method, Ordinarily, two levels 
of logic aTB required to propagate a carry To speed up the carry 
propagation. ar\ NMOS series carry chain is employed (Fig 2). 
This scheme exploits the nomnvertjng and function provided by 
transmission gates, and replaces two levels of logic with a single 
transmtssion gate delay The resulting adder is capable of pro- 
ducing Its result within 10 ns of its receipt of operands. 

Shift Merge Unit 

The SMU consists of three major components: a double- word 
shifter, a mask generator, and merger logic, The simplest SMU 

operations are the shift double instructions The source 
operands are concatenated and Jhen shifted right by a specified 
amount. The shifter js organtzed into five stages, each of which 
is hardwired to perform a partial shift. The first stage optionally 
shffts 1 6 bits, the second 8 bits, and so on. Using the shift stages 
in combtnation, shift amounts from through 31 bits are accom- 
modated. 

The EXTRACT instructions extract a specified field from a single 
operand, and place ft right justified In the resuft. These use the 
shifter to perform the justification of the operand. The unused 
bits of the result are filled with zeros, or optionally sign-extended 
(filled with the most significant bit of the extracted field,) The 
mask generator is used to distingutsh zero or sign-filled bits of 
the resuli from those containing the extracted value. The merger 
logic then produces the result from the mask and shifter output. 

DEPOSIT instructions make up the remainder of the SMU func- 
tions Essentially, DEPOSIT is the inverse of the EXTRACT opera- 
tion: a right justified field of a specified length is merged into Ihe 
result register at a specified position Again the shHter is used 
to position the specified field into the result. For deposit, a mask 
IS produced which distinguishes the deposited field from the 
unaffected portion The merger logic then assembles the result 
from the shifter output and the result target (which is itself an 
operand for DEPOSn) according to the mask 

The ALU and the SMU share a common result bus which 
carries their results back to the general registers. Once a result 
is computed, it is evaluated by the test condition hardware, In 
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general, 1 4 non trivial test conditions may be specified. To speed 

up the evaluation, ail 1 4 condifions are computed in parallel The 
condition specified in ihe inslruction format has been decoded 
in parallel with the result computatiors (the code for the condilion 
has been available since the instruction decode pipeline 
stage.) The proper condition Is selected via a multiplexer circuit, 
and the E-unit has compieted fts operation. 40 ns has elapsed 
since the E-unit first received the operands. 

Conditional branch rnslructions aiso rely on the E-unit's oper- 
ation to determine If a branch should be taken. The result of a 
conditional branch computation is used to determine the branch 



condition, For ADD AND BRANCH instructions, this result is aiso 

stored into the general registers. For other branch instructions, 
the test condition is compuled and the resuii is discarded. 
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(continued from page 1£) 

instructions. Tliese instructions have address modification 
mechanisins wtiich allow code to work on data structures 
efficiently. 

The architecture provides an extensive set of instructions 
for arithmetic, logical, and field manipulation operations. 
Prmiitive operations serve as building blocks for complex 
operations tiiat cannot execute in a single cycle, such as 
decimal math and integer multiply and divide. Instructions 
are also provided to control coprocessors, which pGrform 
more complex operations such as floating-point math, Co- 
processars can perform their operations in parallel with 
CPU program execution until data dependencies force an 
interlock. 

The architecture has both conditional and unconditional 
branches. All branch instructions have the delayed branch 
feature. This means that the instruction following the 
branch is executed before tlie target of the branch is exe- 
cuted. This feature reduces the penalty normally associated 
with branches in pipelined CPUs. A BRANCH AND LINK in- 
struction is provided to support subroutine calls. This saves 
the return address of the calling routine in a general register 
before control is transferred. 

The architecture also defines a control feature called nuJ- 
lification. All branch instructions and data transformation 
instructions can conditionally nullify the next executed 
instruction. When an instruction is nullified, it executes 
as a NOP. The effect is logically the same as a skip instruc- 
tion, except that no branch is required and the CPU pipeline 
is spared the branch overhead. The architecture also 
specifies a set of 25 different types of interruptions which 
cause control to be transferred to interrupt handling 
routines. Upon completion of interruption processing, a 
RETURN FROM INTERRUPT instruction returns control to the 
interrupted routine. 

CPU Chip Overview 

The major functional blocks of the CPU are shown in 
Fign 1 on page 12 and outlined in the chip photomicrograph. 
Fig. 2. The data path circuitry consists of the functional 
units required for instruction fetching, operand generation, 
and instruction execution. AH components of the data path 
are 32 bits wide. 

The ALU. test condition block, and shift- merge unit make 
up the execution unit. The ALU Is used to execute the 
arithmetic and logical instructions, compute the addresses 
of the LOAD and STORE instructions, and compute target 
addresses for general register relative branches. In addition, 
the ALU computes the result to be used in the condition 



evaluation for most conditional branches. The shift-merge 
unit is used to execute the set of SHIFT. EXTRACT, and DE- 
POSIT instructions and to compute the result to be used in 
the condition evaluation for the BRANCH ON BIT inslruction. 
The test condition block operates on the result generated 
by the ALU or the shift-merge unit to determine if the 
condition for branching or nullification is true^ 

The data path blocks for operand generation are the 32 
general registers and the immediate operand assembler. 
The immediate operand assembler assembles 32-bit operands 
for the ALLI from the immediate fields in the instruction. 

The data path blocks lor instruction fetching include the 
program counter and the branch adder. These blocks com- 
pute the addresses for sequential and branch target instruc- 
tion fetches, respectively. Also included in this section is 
a set of registers to keep track of the addresses and instruc- 
tions currentlj?^ In the pipeline. 

There is a set of control registers that have a variety of 
specialized uses. Some are accessed only by software via 
MOVE TO FROM CONTROL REGISTER instructions. Some con- 
trol registers are also controlled implicitly by the CPU 
hardware?. These include the instruction address queue, 
the interruption parameter registers, the recovery counter, 




Fig, 2. Photonifcrograph of CPU chip. 
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and the intemiptioa processor status n'ord. Other control 
registers control the generation and sen^icing of traps and 
interrupts. The processor status word affects trap genera- 
tion and other control functions. 

The trap logic receives interruption requests from various 
sources both intemal and external to the CPIL Us function 
is to determine which (if any) of the 25 traps are valid and 
enabled, and generate a vector address for the highest-prior- 
ity interruption. It then instrycts the CPU to branch to that 
address* saving the return address in a control register. 

The cache bus interface block consists of 32 address pads. 
32 data pads, 57 control pads, and the logic required to 
control them. This interface is used to communicate with 
off-chip cache, TLB. system interface unit (SlU), and co- 
processor chips. Normally, the CPU is the master of the 
cache bus. An external controller (the SlU) handles copy-in 
and cop3^'back traffic between the cache and main memory, 
as well as multiprocessor cache coherency algorithms (see 
article, page 4). 

Two- phase clock gen era lor circuits capable of driving 
500-pF loads with 3-ns rise and fall Hmes are included on 
the chip. Tagether, the clock buffers consume nearly 10% 



of the CPU's available silicon area. 

The chip is controlled by a number of programmabie 
logic arrays (PLAs) and a small amount of cnstom- designed 
logic. Three large PLAs control the functions of instruction 
sequencing and decoding, and a fourth PLA aids the CPU 
in the control of the cache bus interface* Because of the 
critical timing at this interface- much of its control is dele- 
gated to specialized hand -crafted logic. 

Testing of the chip is accomplished through a serial diag- 
nostic interface port (DIP). The DIP allows serial shifting 
of eleven internal scan paths, which access 1,366 bits of 
internal CPU state. The test logic controls the on-chip scan 
paths and interfaces to an external tester for serial testing 
of the chip. The details of DIP operation and the test capa- 
bill ties it provides are described in ''VLSI Test Method- 
ology" on page 24. 

Instruction Sequencing and Pipeline Perforinance 

The CPU pipelines the fetching and execulion of ail in- 
structions. This allows execution of different stages in the 
pipeline to occur in parallel, thereby reducing the effective 
execution time of each instruction. Fig. 3a depicts the 
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pipelintid execution of three sequential instructions. One 
instructiou is executed every two clock periods, resulting 
in £1 machine cycle time of Gfi ns. This allows a peak per- 
formance rating of 15 MIPS [million instructions per sec- 
ond). The bandwidth t)f the cache bus allows one instruc- 
tion fetch and one data access to be completed each 
machine cycle. 

The pipeline stages executed hy each instruction (see 
Fig. 3a) include instruction fetch (FETCH), instruction de- 
code/operand select (DECODE), ALU operation/data address 
(EXECUTE), and data access (DATA). An additional one-half 
clock period not shown in Fig, 3a is required to complete 
a write to the general registers. Execution of most instruc- 
tions follows this basic pattern. This allows most instruc- 
tions to execute in a pipelined manner at the peak rate. 
Barring pipeline interlocks. 93 percent of all instructions 
in the set will execute in a single machine cycle. Those 
that require additional cycles Include system coatrol in- 
structions for cache/TLB maintenance, system diagnostics, 
and interrupt handling. 

Instruction address generation is done before the FETCH 
stage. The address is generated by the program counter for 
sequential fetches, or by the branch adder or ALU for 
branch target fetches. Other sources of Instruction address 
include the trap logic and the instruction address queue 
for vectoring to and returning from an interruption routine. 
In all cases the instruction address is issued to the cache 
bus on the phase before the FETCH stage. 

Sequential and branch target addresses, however, are ac- 
tually issued by the cache controller to the cache memory 
array before the phase preceding the FETCH stage. This 
allows for a pipelined two-clock-period cache access. Se- 
quentitd instructions are prefetched hy the cache controller 
which maintams a second copy of the program counter. 
The sequential instruction address is actually generated 
three clock periods before the FETCH stage. This allows 
sequential instructions to be prefetched one machine cycle 
before they are sent to the CPU. Thus, sequential instruction 
fetches EU"e completed every machine cycle without a 
pipeline penalty. 

The execution of a branch instruction is shown in Fig. 
3b. The branch instruction uses its data cycle on the cache 
bus to Lssue the branch target address. This is two clock 
periods before the FETCH stage of the branch target instruc- 
tion. Thus, branch target fetches are also pipelined two- 
clock-period cache accesses. The cache controller receives 
the computed branch condition from the CPU and uses it 
to select between the prefetched sequential instruction and 
the branch target instruction returning from the cache. 
Hence, taken and untaken branches execute at the peak 
rate without a pipeline penalty. 

On the first phase of the FETCH stage, the instrtiction and 
its corresponding TLB miss and protection check flags are 
driven to the CPU on the cache bus. On the second phase 
of the FETCH stage the instruction is driven into the chip 
and set into PL A input latches and the instruction register. 
During the FETCH stage, the cache controller prefetches the 
next sequential instruction* 

Instruction decoding and operand selection are done dur- 
ing the DECODE stage. During the first phase of DECODE, 
the PLAs decode the instruction. On the second phase of 



DECODE, control lines fire to send the general register 
operands to the execution unit [ALti and shift- merge unit). 
Also on this phase, the immediate operand assembler sends 
immediate operands to the execution unit. 

The execution unit and the branch adder produce their 
results during the EXECUTE stage. On the first clock phase 
of the EXECUTE stage a data address or branch address is 
valid on the internal address bus. This address is latched 
at the address pad driver and driven to the cache bus at 
the beginning of the next clock phase. For conditional 
branches* the execution unit does a calculation in parallel 
with the branch target address calculation by the branch 
adder. The execution unit does a compare, add, shift, or 
move operation for conditional branches. The result is then 
tested by the test condition block to determine whether 
the branch is taken (see "'Execution Unitn*' page 13), 

Accesses to the cache bus initiated during the EXECUTE 
stage are completed during the DATA stage. These occur for 
loads, stores, semaphore operations, coprocessor opera- 
tions, and cache and TLB maintenance instructions. Store 
data is driven to the cache bus on the first phase of the 
DATA stage. Load data and TLB and coprocessor trap infor- 
mation are received by the CPLI by the end of the first phase 
of DATA, t^oad data is driven to the CPU d^ta path on the 
set:ond phase ot DATA. Control registers are also set t hi this 
state. Load data is set into a general register on ihe next 
phase if a trap did not occur. 

Execution of LOAD and STORE instructions results in a 
degradation [jf perforniance because of practical limits on 
the access times that can be obtained with large external 
cache memories. As shown in Fig, 3c, an additional half 
machine cycle is inserted into the pipeline to allow suffi- 
cient time for the cache memory to complete its access. 
Additional penalties of one machine cycle are incurred for 
using data that was loaded in the previous instruction* and 
for n unification of an instruction. 

The total performance degradation incurred from the in- 
struction pipeline (assuming Jio cache or TLB misses) can 
be calculated by summing the products of the LOAD. LOAD/ 
iJSE. STOREh and NULLIFY penalties and their respective fre- 
quencies of occurrence. For typical multitasking workloads 
the average degradation is 0,39 CPI (cycles per instruction)* 
resulting in sustained pipeline performance of 10.8 MIPS. 
The sustained performance can be increased using an op- 
timizing compiler. All interruptions and degradations to 
normal pipeline execution are handled completely by CPL7 
hardware in a manner that makes the pipeline transparent 
to software. 

System performance is a function of the operating sys- 
tem, user workload, main memory design, and cache and 
TLB size and organization. The CPU pipeline and cache 
implementations are designed to maximize sustained sys- 
tem performance for multiuser computing workloads in a 
system witha manufacturable and cost-effective large main 
memory subsystem. In addition, the on-chip 30-MHz sjm- 
clironous cache bus interface places no constraints on ex- 
ternal cache and TLB si^es and organizations. Different 
main memor>' systems can also be used with this CPU. The 
HP 3000 Series 950 (HP 9000 Model 850S) and HP 9000 
Model 825 Computers use different main memory, cache 
anci TLB systems for different price/performance trade-offs 
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A Precision Clocking System 



\n hjgh-speeci synchronoiiS systems, requirements for last in- 
tefchip communication are very resinctive System frequency 
can be limited by driver speeds, signal propagation time, and 
Cfiip-tchchip CsOCk skew System-wide dock skew presents a 
particyfarly dtfficufl dilemma. To accommodate the high clock 
loading requirements of the NMOS chips, a high^ain multistage 
buffer must be used Although it js possible to integrate such a 
buffer system efficient ty onto stiicon, the buffer delay on each 
chip, and therefore the system clock skew, can vary widely over 
the range of possible operating parameters such as supply volt- 
age and temperature Subtfe variations in the NMOS-lll manufac- 
turing process compound the problem. It woufd seem that the 
high drive requirement is at odds with a low-skew system clock 
distribution 

To reduce chip-to-chrp skew, ali chips m our processor system 
have a local phase- locked clock generator, This circuit ensures 
that the clock buffers on each chip have a matched delay to 
within 3 ns over their specified range of parameters. 

This circuit is nol a phase-locked ioop. It does not have a 
voltage- controlled oscillator Instead, its operation is based upon 
a deiay element and a comparator. The delay element (Fig. la) 
consists o1 a capacitor, a precharger. and a pulldown FET. During 
the p recharge cycle, the capacitor is precharged to V^l When 
the START signal goes high, the capacitor discharges through 
the two series FETs. The rate of discharge is controlled by the 
voltage Vcontrol The lower Vcomtrol' ^^^ ^^^er the gate drive 
and the longer the delay. Higher V^owtrol produces a shorter 
delay. 

The control voltage is generated by the comparator circuit 
(Rg 1 b). Its purpose is to move the VcoNtROL signal until the 
SYNC and CK1 (clock 1 ) edges coincide If Cki occurs late, the 
drain bootstrap circuit that fires dipup is allowed to fire This 
produces a pulse that allows some or all of the charge on the 
small dipup capacitor to be shared between the small dipup 
capacitor and the big Vcontrol capacitor. The voltage Vcontrol 
will rise and the delay in the delay efemeni will decrease. This 
means gki will occur sooner li CKi occurs too soon, the DIPUP 
circuit will be inactive, and the DiPDN node will pulse. Thus the 
charge storad on the big Vcqmtrol capacitor will equalize with 
the small dipdown capacitor, and the Vcomtrgl signal will de- 
Grease slightly, increasing the delay in the delay element This 
Cfrcuit IS essentially a low-pass switched-capacitor filter. 

Other Features 

Chip capacitive drive is scalable Each chip has enough output 
buffers to meet its clock requirements Each buffer block can 
drive 76 pF and contains circuitry to reduce nnging caused by 
path inductance 

Power-up and power-down features are Included, The clocks 
are held low until a circuit detects that the loops are up and 
stable, Then they are released. This ensures that all chips in a 
system only have clocks when thefr loops are correctly placing 
edges relative to the SYNC signal, This eH mi nates drive content ion 
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[see articlet page 4). 

Summary 

The design effort has resulted in a single VLSI component 
that implements the entire instmction set of a next-gener- 
ation computer architecture. This CPU can be used in a 



variety of products spanning a broad range of price and 
performance by appropriately configuring external cache 
memory. TLB, main memory p and coprocessors. In addi* 
tton» high-performance multiprocessor systems can be built 
using this CPU, 
The NMOS4I1 VLSI process was chosen to implement 
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this design because of its proven history of high yield and 
reliability* Manufacturability is enhanced by a worst-case 
design methodology which ensures reliable operation over 
the full ran^e of procf^ss variation, C'C-to- 11 0'C junction 
temperature range, :^10% power supply variation, aod 
vvorsl-case variation ol all external components- 

Performance was not sacrificed to achieve manufactura- 
bility and flexibility. In addition to implementing an in- 
struction set engineered for high throughput and compati- 
bility with next-generation operating systems and compil- 
ers, the CPU employs techniques to minimize instruction 
execution time. Ptpelined execution, instruction prefetch- 
ing, branch target prefetch inj^, multiple internal data paths, 
and a high-bandwidth external cache interface are some of 
the mechanisms used to minimiice the number of machine 
cycles required to execute each instruction, A low-skew 
clocking system coupled with specifd aUention to circuit 
optimization in the critical timing paths results in a 
minimum operating frequency of 30 MHz under worst-case 



conditions. 

The design has t>een proven through extensive functional 
and electrical characterization at the chip, board, and sys- 
tem levels. Performance has been verified under multiple 
operating systems running real user workloads. 
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Design, Verification, and Test 
IVIethodology for a VLSI Chip Set 

by Charles Kohlhardt, Tony W. Gaddis, Daniel L Halperin, Stephen R. Undy, and Robert A. Schuchard 



TEN CUSTOM VLSI CHIPS are used in the processors 
of the HP 3000 Scries 950 and HP 9000 Models 850S 
and B25 Coruputers, The complexity of the design, 
integration f Euid testing required to dehver ten VLSI chips 
dial meet the functional and electrical requirements for 
these products demanded close attention to detail through- 
out the development program. 

The strategy used was to set the expectation that the 
initial lab prototype system built with revision 1.0 silicon 
was to provide a high level of fimctiooality and electrical 
quality. A high level of quality implied that the operating 
systems would boot and that there would be no first -order 
bugs that would preclude charactedi^ation beyond them. 
With this goal met for revision 1.0 silicon, electrical and 
hj Fictional characterization were to proceed covering all 
aspects of individual chips, boards, and systems. With the 
lab prototype characterization complete, the chips would 
then be released for production prototyping with the expec- 
tation that the chips would be production quality. 

Tactics to accomplish this strategy w^ere then defined. 
The first aspect was how to deliver first silicon that w^ould 
meet the lab prototype quality level. This was the design 
methodology. The lab prototype evaluation tactics were to 
define feedback paths required for complete functional and 
electrical characterization of the chips, boards, and sys- 
tems. The production prototyping evaluation tactics were 
to repeat those feedback paths where changes had been 



made or risks of introducing a new bug were present. 

Within the design methodology, heavy emphasis was 
placed on both w^orst-case circuit design and functional 
verification. Functional verification included both FET 
switch level simulations and behavioral level simulations 
at the boundaries of each chip. System level models were 
constructed and behavioral level simulations were per- 
formed to verify some of the first-order chip-to-chip trans- 
actions. This design methodology will be elaborated on in 
a later section of this article. 

With the first release of the VLSI designs for fabrication, 
the characterization feedback paths were defined. Project 
teams were assigned to these feedback, paths with respon- 
sibility for definition of the plans, development of any 
special tools, and execution of the plans. We will elaborate 
on three of the specific feedback paths that w^ere fundamen- 
tal to the proj^ram success. 

One of the primary feedback paths was associated with 
the functional verification of the lab prototype hardware. 
Since the operating system development was staffed in 
parallel wuth hardware development, the hardware design 
team internalized a significant portion of the functional 
verification. This allowed the operating system develop- 
ment to proceed relatively independently of the hard\vare 
verification. Since hardware/software certification w^as late 
in the program schedule, having early functional verifica- 
tion was essential. In addition, special attention could be 
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placed on comer case {rare event) testing, where operating 
system tests are wrilten and executed with a different ob- 
jective, A special operating system called ''Khaos" was 
developed, along with 87K lines of code which stressed 
the comer cases of this chip set. The tools and verification 
results v^ill be discussed in more detaiJ in a later section 
of this article. 

Two other areas of feedback were the electrical charac- 
terization of the individual chips and the integration and 
characterization of the VT^SI at the processor board level. 
A scan path test methodology' was developed which w^as 
applicable to both the chip and board level characteriza- 
tions. The concepts of that methodology and the results of 
its application will be presented. 

Electrical Design Methodology 

One uf ihE3 key a.spects of the desi^jo methodology for the 
chip set was the ability to design far performance. Achiev- 
ing high system clock rates was an important contribution 
of the design effort. There are three major facets of the 
methodolog}^ that address high-performance design: struc- 
tured custom design approach, worst-case design and simu- 
lation, and modeling of the chip set in the package and 
board environment. 

The typical chip has a regular structuxe. This is a result 
of the use of separate data path and control structures to 
implement the chip's functionality. The data path consists 
of registers, adders, comparators, parity trees, and other 
regular blocks. The control structures are implemented as 
PLAs (programmed logic arrays). This approach is typically 
referred to as structured custom design. 

Each block uf the data path, such as a register or adder, 
is designed on the basis of a single bit and then is repeated 
to form a 32-bit'Wide block. The global busing passes over 
the top of each "*bit slice" and is an integral part of the 
structure. Multiple instances of the registers and other 
blocks are then combined by abutment into the register 
stack* This arrangement of blocks gives a high packing 
density within the data path* 

The PLA structure makes it possible to implement ran- 
dom logu: and state machines in a regidar fashion. The 
control of each chip is specified in a high-level language 
and is converted into the PLA structure. The control struc- 
tures were refined and debugged throughout the design 
phase of the chips and were typically the last structures 
to change on the chips. This process allowed the PLA struc- 
tures to change quickly and provided a dense implementa- 
tion of the control logic. 

These regular design structures in the data path and con- 
trol logic helped maxim i/e density and made it possible 
to leverage blocks in cases of multiple instances of a single 
block in the data path. They also offered the capability to 
design for high performance. With such a regular structure, 
it was possible to set accurate timing budgets for the critical 
paths and design the blocks of the data path and the PLAs 
to meet the timing requirements. Other design styles, such 
as gate arrays and standard cells, do not offer these benefits 
since they do not have the same regular structure. By man- 
aging the timing on a block level, maximum performance 
was obtained* 

To guarantee that maximum performance is maintained 



over all opera ting extremes, it is necessary to use a worst - 
case electrical simulation methodology in addition to the 
structured custom approach. Thedesignofthe blocks typ- 
ically started as a schematic which represented the circuit 
before the artwork design. In addition to the MOS transis- 
tors, capacitors and resistors were mcluded to estimate the 
parasiti*: effects of the artwork. The schematic was then 
electrically simulated, and the sizes of the devices were 
adjusted to meet the timing budgets. The circuit w^as simu- 
lated using worst -case models of the devices based on IC 
process spread, temperature extremes, voltage margins, and 
noise effects. 

Ooce the designer was confident in the design, artw^ork 
was created for the circuit. The worst -case capacitance for 
the artwork was extracted and substituted back into the 
circuit, and electrical simulations were rerun to verify the 
performance of the artwork. This resulted in blocks for the 
data path that met performance requirements over the en- 
tire range of operation. 

The w^orst-case methodology was effective for the inter- 
na! blocks of the chip. The pad drivers and receivers were 
given an equal level of attention. 

Package environment effects indnde the inductance of 
the bond wires and the noise on the power supply planes 
caused by the high-current spikes of driver switching. The 
board environment behaves like a transmission line at the 
frequency of operation. This environment required con- 
struction of a complex model of the chip, package, and 
board. This model was simulated using the worst-case con- 
ditions described above. The effects were accurately pre- 
dicted and the bus designs met their design specifications. 

The structured design approach, worst -case design and 
simulation methodology, and careful simulation of the sys- 
tem environment were very effective:. The first releases of 
the chips worked at the target frequency in the system with 
adequate margins using typical parts. 

Design Verffication Metliodology 

When a system consisting of ten chips is designed* seri- 
ous functional bugs will adversely affect the integration 
schedule. The process of fixing a bug^ releasing the chip, 
and fabncation takes on the order of two months. If several 
of the chips had had functional bugs that precluded product 
integration, then the schedule to having working systems 
would have slipped drastically. Therefore, the goal was set 
to have first-revision chips that met all of the requirements 
of the lab prototype system including booting the operating 
systems. Since the ten chips represent ten individual func- 
tional algorithms and there are four major bus protocols 
connecting the chips, the verification problem presented 
no small challenge. 

To meet the goaL behavioral models of the chips were 
written to describe their functionality. These models used 
the same high-level schematics that were being used to 
construct the chips. For each chip, the same source that 
was used to generate the control blocks described above 
was used to generate the description of the control logic 
for the model. Behavioral descriptions were written for all 
of the other blocks based on the schematics and functional 
descriptions. By writing a behavioral model it was possible 
to have a model for each chip long before the artwork for 
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the chip existed. The model also made it possible to connect 
all of the chips into a larger system model- 

Because of the difficulty of generating all of the comer 
cases for a system model including all of the chips, a sig- 
nificant amounl of simulation time was spent verifying 
individual chips, A high-level language was used to specify 
the corner cases and was compiled into the input source 
code for the functional simulator. The high-level compilers 
were Pascal programs written by tiie verification teams to 
increase their productivity in generating tests. To ensure 
good coverage, numerous reviews w^ere held of the corner 
cases that were executed- 

The chip models were put into a complete system model 
as soon as they w^ere capable of executing their basic func- 
tions. The test cases for the system model were written in 
HP Precision Architecture assembly language. A source of 
tests was the set of architecture verification programs writ- 
ten at HP Laboratories as part of the architecture definition. 
These tests covered the functionality of the instruction set. 
In addition to these tests, other code was needed to provide 
coverage for areas specific to this implementation ^ such as 
the CPU instruction pipeline, the bus protocolSt and error 
handling. 

To ensure that the artwork would match the behavioral 
model, the test cases and results were extracted from the 
behavioral simulator and run on a switch-level simulator. 
This simulator represented the transistors as switches ^ with 
each switch given a strength rating for drive contention 
resolution. For each chipp the artw^ork was used to extract 
the transistor netw^ork for the switch modeL When this 
step of the testing was completed, it was possible to guaran- 
tee that the chips would function exactly like the behavioral 
models. 

In the operating system turn-on and re i lability testing, 
only five functional bugs were found, and these required 
only minor workarounds to complete the testing effort. Fig. 
1 is a summary of the resources needed for three of the 
chips in the system. As can be seen, the effort required for 
verification was substantiaL The resulting accelerated 
schedule for system integration and time to manufacturing 
release of the chips more than paid for the time invested. 

Hardware Functional Characterization 

There were several functional areas where it was impos- 
sible to provide extensive coverage during the design phase 
because of the speed limitations of the simulators. Instruc- 
tion sequences and system exceptions such as cache misses 
and traps are examples where the combinatorial pos- 
sibilities were too high to cover all comer cases in simula- 
tion. These cases typically weuld not be tested until sub- 
stantial operating system qualification was underway. The 
schedule made it unacceptable to wait until operating sys- 
tem testing was completed, and this testing w^ould not cover 
cases such as error recovery in the system, so a different 
approach was required to ensure that the chip set met man- 
ufacturing release quality. 

The coverage goal for manufacturing release required a 
methodology that would accelerate multiple occurrences 
of corner case events in the system in much the same way 
that an operating system would do under heav^y job loading 
conditions. To do this, a test operating system, called 



Khaos, was defined and ivritten. This system consists of a 
set of multiple-priority scheduling queues and a set of trap 
handlers that were used by code to handle exception cases 
in a controlled fashion. With Khaos, lest suites could be 
compiled out of specific test programs and the queues man- 
aged to regulate tlie interaction of the programs to ensure 
random, thorough coverage of the events of interest. Khaos 
also provides observ^ability functions for debugging test 
code and diagnosing hard w^are errors. These functions were 
supplemented by the use of logic analyzers. 

To test the processor board, the architecture verification 
programs mentioned above were used, in addition to other 
code. The code w^as all WTitten in assembly language to 
ensure the control needed of the features of the architecture. 
One group of code that was written exhaustively covered 
areas such as the code and data protection mechanisms 
and traps on all instruction classes. Another major group 
of code was referred to as the "thrashers." These pieces of 
code w^ent into the cache and TLB subsystem and changed 
their states so that there would be a high level of misses 
and traps during the execution of other processes. Still 
another group of code covered the error recovery capabil- 
ities of the cache and TLB subsystem. The code used diag- 
nostic Lnstructions defined for this implementation to seed 
errors which would appear during the execution of other 
programs. *rhe programs checked to ensure that error cases 
w^ere handled properly and that execution resumed nor- 
mal ly. 

To test the I/O and memory subsystems, two major groups 
of code were written. One group of code was used to create 
a high level of traffic through the subsystems and exhaus- 
tively execute the associated protocol, HP Precision Ar- 
chitecture assembly code was again used to control the 
architectural features. The C drivers from the operating 
system were leveraged to test the I/O channel. The other 
group of code seeded errors, in addition to using a protocol 
verification hardware board that interfaced witli the bus. 

These individual programs were combined into test 
suites and statistics were gathered to determine a run time 
for high-confidence coverage of the cases that were being 

System Design Stall sties 
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Fig. 1* System design statfBtics and resources for three of 
the VLSt chips. 
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tested. A statistical model was used and most of the suites 
were mn for a period of 8 to 14 hours on the lab protot\'pe 
hardware. The results were very successhil. All of the bugs 
found during operating system testing were also found by 
tiie verification effort. In addition, comer case bugs involv- 
ing error recover)- were located which would not have been 
uncovered by the operating system qualification effort, hi 
fact, one of I he last bugs found in error recovery was a 
resuh of the inleraction and sequence of six distinct events. 
AStet all of the other characterisation was completed on 
the lab prototype and the chip set was released with bug 
fixes, the test sukes were reexecuted on the production 
prototype systems. The resuUs of this testing did demon- 
strate that the chip set was of manufacturing release qualit\^ 
The entire testing effort required 110 engineering months 
and 87 thousand lines of code. The result greatly shortened 
tlie time to reach manufacturing release and provided high 
test coverage of features* 

Electrical Characterizatton 

As part of the design of these chips a comprehensive lest 
methodology was developed. This included internal chip 
components as well as external test hardware and software. 
The essence of the methodology is the use of serial scan 
paths. The scan paths are placed at the chip LD pads and 
at several other plaLies internal to the chip. Use of this 
inethudulogy has been applied at the wafer, package, board, 
and system levels. For a description of the methodology, 
the hardware, and the software, see "VLSI Test Methodol- 
ogy," page 24. 

A wafer and package tester was designed for chip charac- 
terization and production testing- A board tester was de- 
signed for initial turn-on, integration of the subassemblies, 
and hoard level characterizalion. System testers were also 
developed for initial system turn-on in the absence of a 
complete operating system and 1/0 assemblies. 

The following sections will elaborate nn the application 
of the test methodology and the results for the ten VLSI 
chips and processor board characterization. 

Chip Electrical Characterization 

A chip i^lectricai characteri/atitjn inetliodology was de- 
veloped to be consistent with the overall strategy of produc- 
ing production-quality, shippable VLSI parts to meet prod- 
uct schedules. This methodology was standardized and 
applied uniformly to the ten VLSI components. 

The final measure of a chip's manufacturability is its 
yield in production. The total yield, Y^, is defined as: 

Y, = Y,Y, 

where Yj is the functional yields or the percent of all die 
that are functional at the nominal operating pointy and Yy 
is the survival yield, or the percent of functional die that 
survive worst-case operating points. 

Yj is a manufacturing responsibility and is closely linked 
to defect density. Y^ is the responsibility of the VLSI design 
team and goals for survival yield were set at the outset of 
the design effort. The purpose of chip electrical characteri- 
zation is to demonstrate that the chip meets its survival 
goal under worst-case conditions. 



The raj3ge of operating variables to be covered by this 

methodology^ included characterization over the system 
voltage, frequency, and temperature limits with sufficient 
margin to ensure operation beyond these limits. In addi- 
tion, the full range of fabrtcation process deviations was 
considered. 

Voltage, frequency » and temperature can be varied during 
the test. To provide the process variation, character! zal ion 
runs were fabricated, with key process paramelers varied 
within a single run. Automatic parametric lest data col- 
lected from parametric devices located on each die pro- 
vided the process correlation required to identify' worst* 
case characterization parts to be used for electrical charac- 
terization. This allowed correlation of chip yields with 
process parameters. 

The test set for each chip was developed and reiined 
over time to provide tests for all circuits and paths on 
silicon. The test set had several uses between first silicon 
and final production parts. It provided initial screening of 
new silicon to get good lab prototype parts into systems as 
soon as possible. In general, the time from wafers out of 
fabrication to assembled and tested packages was less than 
four days. The lest set was instrumental in full characteri- 
zation of each chip between lab prototype and production 
prototyping phases. Finally, it has evolved into the tests 
required for the manufacturing process. 

The test set for each chip contained several different 
types of tests to ensure complete coverage and full charac- 
ter! 7.at ion. Block tests were generated to test individual 
circuit blocks on the chip, for example an ALU or com- 
parator within the register stack. Pin vectors were automat- 
ically extracted from fulLchip simulation models to pro- 
vide additional test coverage from the chip pads. Ac input 
timing tests were added to test the speed of inpul receivers 
at the chip pads, Output driver and functional trislate tests 
completed the tests used to characterize the pad driver 
functions. 

Typical chip electrical characterization consisted of exer- 
cising h statistically valid sample of parts from a characteri- 
zation run with the chip specific test set over voltage, fre- 
quency, and temperature ranges. The result was a data base 
of failure information for each chip stored off-line to be 
analyzed and manipulated later with a software tool de- 
signed specifically for that purpose. 

Results drawn from the characterization data base were 
used in locating and diagnosing yield-limiting circuits on 
the various chips. Fig. 2 shows a block yield histogram for 
all of the block tests on a given chip. This tool, combined 
with dt^tailed test vector output, allows the isolation of a 
circuit problem down to the offending circuit cell. Fig. 3 
shows how circuits that were sensitive to process variations 
were identified. Circuit block failure data is combined with 
the automatic parametric test data to arrive at block yield 
versus any process parameter collected from chip paramet- 
ric devices. 

These electrical characterization results for each chip 
were carefully evaluated and circuit fixes included in sub- 
sequent chip releases to meet the production survival goals 
initially set. In terms of the initial survival yields from the 
lab prototype releases^ improvements of 1.15 to 2-0 times 
have been observed with production prototyping revisions 
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Chip Block Tests 



Fig. 2. Typical block yield histo- 
gram for all of the blocks on a 
Single chip. 



of the VLSI chips- 
Processor Board Characterization 

The basic goal of the processor board electrical charac- 
terization effort was to determine the operating range of 
the processor boards over frequency, voltage, temperature, 
and VLSt process spreads. In addition^ It was desired to 
provide feedbaclt on any chip defidencies to the VLSI de- 
sign groups as quickly as possible. This minimiy.ed the 
possibility of incurring additional VLSI turnarounds to 
meet performance goals. This effort also supplemented in- 
dividual chip character! station and provided insight into 
the interactions between chipSt clocks, and boards, 

Two different types of tests were developed to stress the 
VLSI parts during the characterization effort. One type con- 
sisted of BVF [block vector file] vectors, which used the 
chip's scan paths to stimulate the chips and observe the 
results. The second type consisted of cache-resident tests. 
The cache-resident tests were HP Precision Architecture 
code WTitten to test some function, determine if the test 
was successful, and store the results in the CPU's general 
registers. To execute a cache-resident test, the instructions 
were loaded into the cache using the board tester and the 
CCUs [cache control units). The code was then executed 
by the processor board and the board tester was used to 
observe the stale of the registers in the processor to deter- 
mine if the test had passed. Cache- resident programs allow 
the processor board to run by itself without interfacing to 
the memory and I/O subsystems. 

The tests were run under a variety of conditions. Environ- 
mental chambers were used to determine the effect of tem- 
perature on processor board performance. A large amount 
of the testing was done with socketed performance boards 
so that VLSI parts could he changed quickly. Characteri^ia- 
tion parts used in socketed boards allowed us to study 
processor board performance with the full spread of VLSI 
parts. Once the worst-case combination of parts was deter- 
mined, boards were built with these parts soldered in to 
remove the effects of sockets from tlie results. \^oItage-ver- 
sus-frequeocy shmoo plots were generated to determine 
the effects of these parameters on the various tests. 



Whenever possible, these parameters were varied until the 
board failed. 

The cache bus is the backbone of the processor board, 
serving as the path for data and control between the VLSI 
parts. Tests were wTitten to test the noise immunity and 
speed of the cache bus. 

h was determined early in the project that the worst-case 
cache bus noise occurred on a line that was electrically 
high while its neighbors were driven low. Thcroforc, six 
BVF tests were written, one test for each of the chips on 
the cache bus. The tests are ''walking ones" tests ^ in which 
each cache bus line in turn is kept electrically high while 
all other cache bus lines are driven low. For each of the 
tests, one chip drives the bus and all chips check thai their 
receivers have the correct value. It is necessary to write 
and read the contents of the scan paths of all the chips on 
the cache bus for over\' vector that is sent across the bus 
during the test. This effectively limits the time between 
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vectors to many thoujjands of states. A cache-resident test 
was written that generated some near- worst -case patterns 
on the cache bus. The BVF tests allowed us to determine 
exactly which line or lines failed a test and thus provided 
good diagnostics. The cache-resident Eest fired patterns on 
the cache hus ever>^ state and better tested the ability of 
the power supplies to handle changes in load current. 

An additional 46 BVF tests were written to tes( critical 
speed paths on the cache bus. For each of the tests, the 
path is driven electrically low and the resulting value is 
verified at the receii ing chip. The speed path tests allowed 
us to verify the performance margin of the cache bus. Since 
specifications for cache bos speed include clock skew, 
propagation delay, and driver and receiver delay, the cache 
bus speed tests allowed us to verify the interaction of these 
specifications. 

In addition to testing the chip-to-chip commimication 
paths of the cache bus, It was necessary to test paths involv- 
ing asynchronous devices, namely static RAM chips. The 
Ri^M arrays play a large role on the processor board> form- 
ing the cache memory and translation buffer memor\^ Since 
the VLSI devices connect to these R/\M arrays directly, 
electrical characterizalion of the RAM address drivers, 
RAM data drivers and receivers, and various related inter- 
nal speed paths was essential. 

Two niGtliods were used to lest these critical paths for 
speed and noise margins. The first method used BVF tests 
to exercise a chip's data paths and control structures along 
a sensitij^ed asynchronous path. Typically, RAM address 
drivers were loaded and directed to drive on the first step 
of the lest. The data from the previously loaded RAM array 
was received and latched in various registers on the clock 
cycle immediately following. After the test, the contents 
of the receivers and registers were examined to determine 
if they were the expected values. The internal paths tested 
included comparators, parity trees, and PL As. 

The second method used cache- resident code. Programs 
were written to stress the RAM interfaces. These programs 
were geared to test either the translation buffer array or the 
cache arrays. 



Typically* alternating address and data patterns were 
issued to and received from Ihe RAM arrays. After execu- 
tion of the program, registers on the processor board were 
examined to determine I he results or failure mode of the 
program. These tests covered circuitr>' on all of the proces- 
sor board V^LSI chips. 

The lab prototype processor boards w^ere fully functional 
over most of the specified operating range w^ith nominal 
parts. Nevertheless, fifteen margin problems were uncov- 
ered during electrical characterization that could occur 
with a worst-case combination of VTSI parts, frequency, 
power supplies, and temperature. Six of the problems were 
speed problems. The speed problems were evenly divided 
between the cache bus and the RAM interfaces. Another 
six of the problems were caused by power supply sen- 
sitivities. Two of the problems were caused by non-VLSI 
circuitry on the processor board. One problem w^as caused 
by the an-chip clock circuit which was shared by all the 
chips. 

When a problem was discovered, the information was 
forwarded to the chip design group. The proces.sor hoard 
characterization team worked with tlie chip group to make 
sure that the cause of the problem was understood. A bug 
report was then generated which described the problem 
along with any pertinent information so that all groups 
were made aware of the problem. The chip group used this 
information as well as other feedback paths to ensure that 
the next revision of the chip was of manufacturable quality. 
Meanwhile, the electncal characterization team made sure 
that other problems were not masked by any already dis- 
covered problems. 

When production prototype boards were available, the 
full set of tests run on the lab prototype boards was re- 
peated. The operating margins of the production prototype 
boards w^ere significantly improved over the lab prototypes. 
Fig. 4 shows an example of the lab prototype electrical 
quality and the improvement observed with the production 
prototype version of the processor board. In all cases, the 
production prototype boards work with worst-case combi- 
nations of VLSI parts, frequency, power supplies, and tem- 
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VLSI Test Methodology 



An integrated test methodology was developed for the rapid 
turn-on and characterization of the VLSI chips, boards, and sys- 
tems described in this issue. The methodology provides solutions 
tor each of the three tightly coupled components o1 the VLSi test 
problem — design for testability, lest systems, and test vector 
generation 

Design lor Testability 

The design for testability methodoiogy is a serial test method 

in whtch only a subset of the memory elements on each chip are 
scanned, thereby reducjng test circuit overhead. The key aspects 
of the on-chip design for testability methodology are: 

■ Common diagnostic interface port (DiP) to provide a uniform 
interface with the tester independent of the chip's normal sys- 
tem interfaces. 

• Access to control and data sections using scan paths placed 
only at the key interfaces. 

■ Single-step testable circuits 

■ I/O pads testable via the DiP. 

Fig, 1 shows a simplified block diagram of a typical chip with 
DIP, test PLA. and scan paths. The DiP and the test PLA are the 
core of each chip's test circuitry. They multiplex up to 16 serial 
scan paths and control chip operation. The DIP uses four dedi- 
cated I/O pads to implement a common protocol for shifting the 
on-chip scan paths and for controlling test operations. The impor- 
tant point about the protocol is that one of the pads, the strobe 
pad. provides the clock for shifting scan paths into and out of 
the data pads at a rate determined by the tester. Ttiis means 
that the tester data rate does not limit tlie system clock frequency 
and permits low-cost tester implenientation. A scannable register 
within the DIP holds a 9-bit command word which specifies a 
particular test operation to be performed. 

The test PLA has two primary functions. First, it controls the 
DJP hardware to impiemeni the interface protocol. Second, it 



decodes the DIP command and generates the controi signals 
required to perform the test operation Each chip must implement 
basic commands to shift one ot up to 1 6 scan paths, to halt or 
freeze the state of the chip, and to singSe-step chip operation. 
Since the test PI_A is automatically generated from a high-level 
descriptton, additional test operations are easily added for a 
particular chip. 

Most of our chip designs are partitioned mto separate data 
and control sections. The data section or data path consists of 
custom functional blocks which communicate v^a local and global 
buses The control section is implemented with large synchro- 
nous PLAs. Complete testability of the PLA control section is 
achieved by fully scanning all inputs and outputs. This allows us 
to halt or freeze the PLA sequencing and to test the PLA array, 
Testability of the data path is achieved with one or more scan- 
nable registers which provtde read and write access to each 
bus in the data path. Any functional block thai is not directly 
scannable is testable because the global buses are controllable 
from scannable registers and the block's control tines are conErof- 
lable from the PLA scan paths. Control lines are fired to transfer 
data from the scannable register to the block under test, perform 
the desired function, and return the results to another scannable 
register. 

Single-step testing requires that each circuit in the chip be 
haltable and steppable so that scan operations can be performed 
without altering the chip state. In NMOS designs with dynamic 
circuits, it is not possible to stop the clock to halt circuit operation. 
In this case, each circuit must have an idfe state or combination 
of control inputs that causes the values of any memory element 
to remain constant tn addition, each circuit must be able to enter 
and exit that idle state cleanly to ensure that single-step operation 
is the same as free-running operation. The result is the ability to 
halt a free-running chip or system. Once the chip or system is 
in the idle state, the state sequence can be altered to perform 
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a wide variety of lest operations such as tesling individual blocks 

or simply observing the current state and resuming operation. 

At -speed functional testing of tlie pad driver and receiver cir- 
cuits is con t rolled by a separate I/O scan path, During test oper- 
ation^ the data from the driver scan latch drives the pad circuits 
while the receiver scan latch captures the result. In a board or 
system, the 1/0 scan path circuits enable both eiectncal and 
functional analysis of system bus transactions and the emulation 
of signal responses from uninstalted chips or subsystems. 

The on-chip test circuits require <10% of chip area and <B% 
of the chip's power. To ensure that the DIP does not limit yield 
or performance, it is designed using conservative design tech- 
niques and for 45-MHz operation. 

Test Systems 

An integrated family of testers was developed to meet the test 
system requirements for wafer, package, board, and system test- 
ing. (Fig 2) A common fester operating system developed on 
an HP 9000 Pascal workstation provides a uniform user and lest 
vector interface with special emphasis on interactive test and 
diagnostics Any command can be executed from the keyboard 
or compiled into a test program. Test vectors can be leveraged 
at each testing phase. 

The test system hardware consists of an HP 9000 Series 200 
controller, a set of HP-IB (IEEE 488/IEC 625) instruments, and 
custom DfP and test head circuitry. The simpEest version is the 
system tester li consists of the system controller and functional 
test hardware for the serial test of up to 1 92 chip systems. The 
board tester version provides power and clocks for the board 
under test, and the package tester adds a 288- pin test head 
with parametric test capabitity and timing generators for input 
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Fig, 2. Jest system family. 



testing. Finally, the wafer test system adds a cassette-to- cassette 

wafer prober with optional microprobe capability. The network 
interface is used to transfer vector files and store functional test 
data for off-line statistical analysis. 

At the board and system level, the test fnterface is implemented 
by connecting the DIP signals for each VLSI chip to the tester 
DIP tnterface hardware. The job of the DIP interface is to synchro- 
nize the DIP operations to make it possible to halt or single-step 
an entire board or system. This also gives us access to and 
control of all the buses in the system. 

Test Vector Generation 

The test vector generation process uses a divide-and-conquer 
approach to manage the complexity of the problem, The chip is 
partitioned into independently testable functional units or blocks 
A register, an ALU plus the operand registers, or a PLA are 
exampies of blocks Block tests are the independent tests gener- 
ated for these blocks in terms of the block pons and are written 
in a high-level test language. Block tests are generated in three 
ways: manually generated by the block designer, leveraged from 
the simulation vectors used in the design phase, or in the case 
of PLAs, automatically generated using a stuck-at fault model to 
ensure fault coverage. A set of tools was developed to compile 
the block test into serial DIP commands in the form required by 
the tester These tools also provide translation to and from 
simulators for the verification and generation of block test vectors. 

Don Weiss 

Project Manager 

Colorado Integrated Circuits Division 
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(ctjnliriued Irom prjge 23) 

perature over the fully specified range of the products with 
significant margin. The same tests used in the board charac- 
terisation effort are also used in board manufacturing and 
failure diagnosis. 

Conclusion 

The design methodology to achieve the first release of 
silicon resulted in a design cycle of 10 months or less. Our 
cycle time response far mask generation, IC fahrication, 
assembly, and test was less then five weeks. A processor 
board was integrated and running cache-resident code 
within two days of delivery of the last component. 

The test methodology alkiwod partial integr^ition and 
turn-on of the processor bosrd as well as at-speed electrical 
characteris^at ion. i ndependent of the rest of the system com- 
ponents. The level of functionality obtained with the lab 
prototypes resulted in completion of the HP-UX boot for 
the HP ftOOO Model 825 processor in less then one month 
from the delivery of the last component. Functional bugs 
encountered in the evaluation phase were minor. In a few 
cases the operating systems were required to patch these 
bugs, but the patches were trivial in nature. 

In general, the electrical quality of the lab prototype 
hardware resulted in ,systems that operated with margin 
around the nominal operating point. In characterization, 
under worse-case conditions of voltage, temperature, and 
normal process variations, some margin problems were 
identified. These results were consistent with the original 
strategy tliat was set for the lab prototype version of the 
VLSI. 

The above methodology proved powerful, in that fi%^e 
chips of the ten were released to manufacturing w'ith only 
two revisions. The remaining five chips, which required a 
third release, provided functional and electrical quality 
that allowed the system integration to proceed electrically, 



mechanically, and functionally according to schedule re- 
quirements. The CPU chip was released to manufacturing 
as revision 3.0. The CPU's first revision served as a vehicle 
to demonstrate the design and lest tools. This release was 
before the time the cache system defmition was complete, 
and as such, required a second release to achieve lab pro- 
totype quality. Two of the remaining four chips required 
a third revision for improved manufacturing margins, while 
the remaining two required a third revision for functional 
bugs. 
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A Midrange VLSI Hewlett-Packard 
Precision Architecture Computer 

If s designed for mechanical and electrical computer-aided 
design, computer integrated manufacturing, real-time 
control, and general-purpose teciinical applications. 

by Craig S. Robinson, Leith Johnson, Robert J, Horning, Russell W. Mason, Mark A. Ludwtg, 
Howell R. Felsenthal, Thomas 0. Meyer, and Thomas V, Spencer 



THE GOAL ESTABLISHED for HP Precision Architec- 
ture computers was to provide a scalahle set of 
hardware and software with the flexibility to be con- 
figured for many different applications in a wide variety 
of market areas. The HP 9000 Model 825 (Fig. 1) is a mid- 



range, compact, high-performance NMOS-III VLSI im- 
plementation of HP Precision Architecture, The wide range 
of system components available in this architecture are all 
compatible with the Model 825, These include operating 
systems, languages, graphics, networking, and a wide vari- 
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Fig. 1, The HP 9000 Model B25 
(5 a midrange. compact NMOS 

VLSI implementation of HP Preci- 
sion Architecture It m designed 

for both Single-user workstation 
and multiuser applications run- 
ning the HP-UX operating system. 
The SPUiS the middle unit at left. 



ety of peripherals. Also, because adapting to established 
environments and easy porting of existing applications are 
of vital import, the Model 825 has been designed in accor- 
dance with international standards wherever possible. 

User Requirements 

Tbe definition of the Model 825 v^^as driven by require- 
ments from several application areas. As a high-perfor- 
mance graphics workstation for mechanical engineering 
and electrical engineering computer-aided design, small 
size coupled witb high floating-point c:omputational per- 
formance for computationally intensive technical applica- 
tions was required. The sirM of the configured system 
needed to be relatively small, since in GAD applications, 
the SPU is often in the immediate work area of the user. 
For the same reason, minimizing the level of audible noise 
was important. As a general-purpose technical computer 
running the HP-UX operating system, the product required 
a flexible range of 1/0 configurations. 

Additional requirements were presented by computer 
integrated manufacturing and real-time control applica- 
tions, where battery backup of main memory is a require- 
ment* The battery backup option provides at least 30 min- 
utes of backup powder during power outages. Also required 
w^as the ability to operate over the ranges of temperature, 
humidity » electrical interference^ and mechanical vibration 
typically encountered on the factorj'^ floor. 

Overall Design 

Pig. 2 shows the major assemblies of the Modal 825 Com- 
puter, and Fig, 3 shows how the product is organized to 
meet the user requirements. The enclosure is 325 mm wide 
by 230 mm hi^b by 500 mm deep, compatible with other 
HP stackable peripheral components. Within this enclosure 
is a series of card cage slots capable of accommodating a 
wide range of user configurations. 



Nine card slots are available. The first two hold the pro- 
cessor and system boards. The remaining seven slots can 
be used for system memory. I/O interface cards, interfaces 
to high-performance bit-mapped graphics displays includ- 
ing the HP 9000 SRX Solids Rendering Accelerator, and 
adapters for I/O expansion, 

A memory card [8M bjrtes) and an 1/0 interface card are 
half- width cards, and together fill one card cage slot. 
Graphics interfaces and I/O expansion adapters are full- 
width cards. The following are possible configurations; 
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The Model 825 is rated at 3.1. MIPS supporting multiple 
users in technical HP-UX applications and at 5.2 MIPS in 
single-user applications. 

Model 825 Processor 

The Model 825 processor consists of two boards. The 
main processor board contains the core CPU function, in- 
cluding a 16K-byte cache, a 2K-entry translation lookaside 
buffer (TLB), clock circuitry, and several bus interface cir- 
cuits. The second board contains most of the floating-point 
math subsyslem, the I/O channel, and the processor depen- 
dent hardware. These two boards plug into adjacent 
motherboard slots and communicate via the math bus and 
the MidBus. 

Main Processor 

The Model 825 processor is highly integrated ♦ consisting 
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Rg, 2. HP 9000 Model 825 major 
assemblies. 



of six high-speed VLSI circuits communicating via the 
caclie bus. These six chips are tlie CPU. wliit:h contains 
the core CPU function and implements the HP Precision 
Architecture instruction .set, the translation control unit 
[TCU), which performs virtue 1-to-real translations and ac- 
cess protection, two cache control units (CCU], each of 
which controls a static RAM array that makes up a ''group" 
of the cache memory, a math interface unit (MIU), which 
implements the floating-point math coprocessor function 
and controls the floating-point math chips, and the system 
interface unit for the Model 825 (SIUF). which interfaces 
the cache bus to the main memory bus, the MidBus. Details 
of these chips are discussed in the paper on page 4. 



These VLSI chips are built using HP's high-performancfi 
NMOS-lII process. They are packaged in 272-pin piu-grid 
arrays, and consume from 7 to 12 watts depending on the 
chip type. The basic system frequency is 25 MH?.. Providing 
an environment in which these chips can operate Ls a sig- 
nificant design challenge. 

Cache Bus 

Details nf the operation of the cache bus are covered in 
the paper on page 4. In the Model 825 implementation of 
the cache bus, special attention was paid to propagation 
delays along printed circuit board traces. For maximum 
performance, propagation delays were minimized by using 
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Rg, 3. HP 9000 Model 825 sys- 
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transniission ime techniques ' and minimum trace lengths. 

In addition, considerable effort went into minimizing 
electrical noisa for increased system reliability, as dis* 
cussed tti the next few paragraphs. When \vorst*case ontpy t 
levels and input thresholds are considered, the uoise mar- 
gin on the cache bus is aboul 0.5V for both low and high 
levels. Thi^ is similar to noise margins in TTL systeDis, but 
because of the high signal transition rates and the receiver 
characteristics, careful design was necessarv^ to achieve a 
workable noise budget. 

The largest contribution to the noise budget outside the 
pin^grid array, is trace-to- trace crosstailt. Typical printed 
circuit design rules allow traces to be routed with 0.01 6-in 
center-to-centef distances. Spacing this tight can result in 
crosstalk of 40% or more. The Model 825 processor board 
construction and design rules were defined so as to Umit 
trace-to-trace crosstalk to less than 10%. 

The next largest possible noise contribution depends on 
the method by which the bus is terminated. Tbe NMOS-OJ 
receivers and drivers are designed such that the bus can 
operate with no termination at the ends. However, this can 
effectively double the magnitude of transitions on the bus 
and therefore double the amount of coupled noise,^ Ter- 
minating the end of the bus effectively reduces the mag- 
nitude of reflections, resulting in lower coupled noise. This 
also helps to absorb the noise that is coupled onto a victim 
line. 

Each cache bus line is terminated by a resistor close to 
the characteristic impedance of the board traces. Most lines 
are bidirectional and are term tna led at both ends. Same 
lines are unidirectional and are only terminated at the re- 
ceiving end to save power and reduce part count. 

Special resistor packs are used to terminate the cache 
bus. These packs are designed for low inductance and low 
common lead resistance to reduce crosstalk internal to the 
resistor pack. 

One disadvantage of resistor terminators is increased 
power dissipation. For the Model 825 design there is 
another problem. Power consumed by the bus depends on 
bus tictivily. Under some processing conditions, bus powder 
can cbange from essentially no load to full load, or vice 
versa, tn one machine cycle. Power supplies are typically 
not capable of responding to transients of this speed. Power 
supply droop affects precharge level and consequently re- 
duces noise margin. This was solved by mounting a low- 
series- resistance, high-valued aluminum electrolytic capac- 
itor directly on the main processor board. 

Math Subsystem 

1 h(^ Model B25 has a floating-point math coprocessor 
Its interface to the main processor is the math interface 
unit fMlU] chip, which connects to the cache bus and con- 
trols the three floating-point math chips, Ideally, the float- 
ing-point math chips should be on the same board as the 
MIU. However, board space constraints would not allow 
this. Instead, the floating-point math chips are located on 
the adjacent system board, and the math bus is run to them 
through the motherboard. The extra time necessary to trans- 
fer data across the motherboard is minimal and does not 
cause a performance loss at the Model 825 *s frequency of 
operation. 



An additional constraint on the design of this part of the 
system was that power supply^ considerations and power 
dissipation on the boards made it impossible to terminate 
this bus. There was also no room on either board for the 
terminator components. A workable system was demised 
by building a detailed Spice model of the entire intercQU' 
nect system. NMOS-lII driver sizes tvere selected such that 
speed is sufficient, but transition time is maximal. Special 
treatment was given to critical clock lines that run from 
the MiU to the floating-point chips. 

Cache Array 

The Model 825 has a leK-by^te cache organized as two 
8K-byte groups. Each group is implemented with a cache 
control unit [CCU} and eight 2K x 8 25-ns static random-ac- 
cess memories (SIi.\Msl. Five of the SRAMs are used for 
data and data parity and three are used for tag. status, and 
tag parity. 

The cache access time is in the critical path that deter- 
mines the clock period for the CPU* which Is directly pro- 
portional to the performance. The Model 825 clock period 
is 40 ns. The address is driven on a clock edge, and the 
CCU must determine if there is a cache hit by the next 
clock edge. It takes the CCU 7.5 ns to compare the tag to 
the real address. With 25-ns SRAMs, this leaves 7.5 ns for 
the address to be driven to the SRAMs and data to be driven 
back to the CCU. The timing has been verified by the use 
of Spice simulations and experimentation. 

Each RAM address line is a single line with a Schottky 
diode at the end to clamp undershoot. There is also a ISOH 
resistor to 2.85V. The undershoot is clamped mainly to 
prevent the voltage on the line from ringing back up above 
0.7V. The resistor is not for transmission line termination. 
Its main purpose is to limit the high-level output voltage 
of the driver. As a result, the bigh-to-low voltage transition 
is smaller, giving less ringing in the fast case and making the 
slow case faster. The slow case model is dominated by the 
capacitive effects and the limited current that can be pro- 
vided by the driver, and so a smaller voltage transition will 
be faster. This can be seen in the basic capacitor equation: 

[xdT = CxdV. 

Simulations were done to determine the optimal value 
of resistor to use. A smaller resistor always helps improve 
the low-to-high transition time because it increases the 
current. For the high-to-lovt^ transition a smaller value helps 
decrease the transition time by making dV smaller, but also 
causes an offsetting increase in the transition time because 
it decreases the current available to change the voltage 
across the capacitor. The termination voltage could also 
have been optimii^ed, but this was not necessary because 
the liming budget was met using the already available 2.85 V 
supply. 

Clock Circuit 

To meet the system skew budget for the Model 825, each 
chip must receive ^i master clock (clock SYNC) that transi- 
tions from 0.9V to 4.1V in less than 3 ns. There must be 
less than 600 ps skew from any chip to any other chip. 
There are additional specifications for low and high levels, 
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duty cycle, etc, 

To put 600 ps in perspective, electromagnetic waves 
travel through glass epoxy printed circuit board material 
a I roughly 200 ps/inch. 600 ps is the delay of about 3 inches 
of printed circuit hoard Irace. Since there are six VLSI chips 
that require this signal, and each chip is about 2.2 inciies 
on a side, it is not possible simply to connect the clocks 
in a daisy-chain manner. 

The solution is to supply the master clock to each VLSI 
chip with a separate, equal-length trace. All of these clock 
supply lines emanate from a single drive point. 

It is also desirable that the rise time of the master clock 
be the same at each VLSI chip. This is a problem because 
the nominal mastt^r clock input capacitance is somewhat 
different for each chip type. The rise time at the chip re- 
ceiver is roughly the no-load rise time plus Z,-,C, where 7.^ 
is the characteristic impedance of the master clock line 
and C is the input capacitance. This problem is alleviated 
by adjusting the master clock line impedance for each chip 
such that Z,-^C is constant for aH chip types* Additionally, 
so that these impedances track as closely as possible, all 
clock traces are run on the same trace layer. 

Since it is important for the chips to receive a clean 
master clock signal, termination is necessary to reduce re-, 
flections. Source termination was chosen for its low power 
and reasonable drive current levels. 
Clock Buffer Circuit* The single drive point impedance is 
about zn. Combined with the level and rise lime require- 
ments of the VLSI chips, this dictated the need for a special 
clock buffer circuit. The circuit can be split into two pieces: 
the front end ^ which generates a signal with the appropriate 
rise time and high and low levels* and an output section 
capable of driving 711. 

This circuit is implemented i\\ discrete high-frequency 
transistors. Great care is taken to bias the collector-base 
junctions to minimize the Miller effect. 

The front-end stage takes the TTL-level ijiput signal. 
sharpens the edge, and produces the correct level for the 
output stage. The output stage consists of several emitter 
followers that transform the front end's high-impedance 
signal to the low impedance necessary to drive the distri- 
bution lines. 

Printed Circuit Board Construction 

Since cache bus design requires transmission line tech- 
niques, the printed circuit board itself must be constructed 
in a control led- impedance manner. The characteristic im- 
pedance of a printed circuit trace is determined primarily 
by the width of the trace, the dielectric constant of the 
insulating material, and the geometry of the trace in relation 
to its adjacent reference plane(s). For reasons related to the 
board fabrication process, all Model 825 traces are of the 
stripline variety, that is, the traces are on one board layer 
(signal layer], and this layer is sandwiched between tw^o 
board layers w ith conductive planes on them (plane layers). 

High-pin-countt high-speed VLSI created a significant 
problem for the printed circuit board construction. The 
two planes that form the stripline configuration should be 
perfectly ac coupled. If they are not. the signal trace and 
the planes form w^hat can be viewed as a capacitxve divider. 
When a signal propagates down a trace, some voltage is 



induced in one plane with respect to the other Typically, 
one of the two planes doubles as a power supply distribu- 
tion layer. The result is noise in the power supply that is 
also coupled down onto other victim traces. 

Normally this problem can be neglected because the 
plane- to-plane capacitance is much greater than tlie trace- 
to-plane capacitance, and transitioning lines are spatially 
and temporally far enongh apart so that the net effect is 
small 

On the other hand, the high-pin-count, high-speed VLSI 
used in the xModel 825, in combination with relatively high 
logic swing levels (as much as 3V], is capable of causing 
as many as 62 closely spaced lines to transition nearly 
simultaneously. This can result in significant noise 
coupled into the planes and signal lines. 

The obvious solution is to use ground planes between 
all signal layers. The ground planes would be much closer 
to the ideal situation, since they are tied together by a large 
number of vias. Unfortunately, this is not teasible because 
of board thickness and cost considerations, Carefid analysis 
yielded a board construction with sufficient noise decou- 
pling and reasonable overall thickness: 
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All signal layers except for layer 9 have 0.008'in traces, 
impedance-controlled to 50 to 70 ohms- All cache bus 
traces are on layers 3. 5, and 7, Layer 9 has slightly lower 
impedance and is used for TTL and other miscellaneous 
signals. 

Printed Circuit Board Layout. One of the most significant 
challenges ot the Model 825 processor was the trace layout 
of the main processor board. Almost every trace on the 
board had a length raslriclioji. Cache bus topology had to 
be rigorously controlled. Clock buffer performance was lay- 
out dependent. Thermal and interconnect considerations 
restricted parts placement. The board contains three major 
buses and three distinct static RAM arrays. It w^as extremely 
important to limit the number of layers and the board tliick- 
ness for reasons of cost and manufacturabilily. 

Autorouting was out of the question. It also became clear 
that previously used systems were inadequate. Since we 
knew that board complexity would require hand layout, 
w^hat we needed w^as a powerful graphics editing system. 
HP*s Engineering Graphics System [EGS) is such a system, 
and it was readily available, The flexibility of EGS. com- 
bined with specially written macros and programs, allowed 
us to build a system tailored to the needs of this layout. 
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10 Channel 
The 10 system used in the HV 9000 Model 825 is HP s 

CIO. The system board contains a single V'LSI chip that 
converts from the Mid Bus to Ihe CIO bus. Details of it are 
described in the paper on page 38. 

Processor Dependent Hardware 

HP Prt^cbion Architecture allows implementation de- 
pendent compuler functions, such as the Model 82 5 's con- 
trol panel interface, stable store, and time-of-day clock, to 

be realized in a way that is convenient for the p articular 
design. Detaib of operating these functions are hidden from 
higher-level softi^v^are by a layer of firmware called proces- 
sor dependent code. This code is stored in ROM and begins 
execution immediately upon power-tip, 

A primary goal of the processor dependent hardware 
design was low cost. This hardware communicates with 
the processor using a special simplified protocol and only 
the low-order byte on the Mid Bus. The architecture re- 
ser\^es a certain area of the real address space for processor 
dependent use. The SItIF chip decodes references to these 
locations^ assembling the bytes into words if necessary for 
the CPU. 

The processor dependent code ROM Is simply a large 
byte-wide ROM, It contains self -tost code, boot code, the 
processor dependent firmware, and other information spe- 
cific to the Model 825 implementation of HP Precision 
Architecture, 

The processor dependent hardware includes a battery- 
backed CMOS real-time clock for keeping the correct time 
and date while the computer is turned off. The batteries 
are the same as those used in many watches, cameras, and 
calculators and provide up to a year of backup time, 

A stable store is implemented in an EEPROM, Boot path, 
system serial number, and other similar information is 
stored here. Constants kept in stable store also assist in 
more accurate timekeeping and help reduce factory cost* 
During system operation^ real time is kept by software, 
using timing derived from the 25-MHz main system clock. 
During power interruptions, real time is kept by the battery- 
backed CMOS clock circuit. The CMOS clock circuit has 
its own independent, low-power, low- frequency cr>'slak 
The availability of stable store means that crystal correction 
factors can be stored for both the main system crystal and 
the backup clock crystal. This allows the use of less expen- 
sive crystals and provides more accurate timekeeping- In 
board manufacture or service, a highly accurate time base 
is u.sed to measure the crystal frequencies and correction 
factors are wTittcn to stable store. To take full advantage 
of this scheme, both the 25-MHz main system clock and 
the real-time clock crystals are located on the system board. 
This way the correction factors and the correctable devices 
are installed and replaced as a unit. 

Also in the processor dependent hardware is a latch the 
processor can read to determine the status of the secondary 
power and other resources. There is also a register the 
processor can use to contrt>l the slate fjf I he front-panel 
LEDs, 

Memory Subsystem 



boards. Each board contains an SM-byte memorv^ array 
which is interfaced to the MidBus by a memory controller. 
The block diagram of the memor\' board is shown in Fig. 4. 

The memory controller is a custom integrated circuit 
implemented in HP's NMOS-III technology, [t is designed 
to provide the following features: 

■ Provide all signals to control 120-ns lM*bit dynamic 
random-access memory' chips (DRAMs) 

■ 19-Mb>te'S transfer rate on the 8.3-MHz MidBus 

■ Error logging 

■ Correct single-bit errors and detect double-bit errors 

■ Provide a mechanism to correct doubie-blt errors with 
a knov^n hard error 

■ Support of memory' board test and diagnostic features 

■ Refresh 

■ Batter}^ backup 

■ Compact size which, combined with surface mount tech- 
nology* allows a board size of less than 50 square inches. 

Memory Bus Interface 

The Model 825 memory board supports 16-byle and 32- 
byte block read and write operations and a semaphore op- 
eration. High-bandwidth data transfers are provided by the 
32-byte transactions- The memory bus interface consEsts of 
a 72-bit-wide data bus, a 10-bit-wide address bus. two row 
address strobes (RAS), four column address strobes (CAS), 
and one write enable signal (WE). Multiple RAS and CAS 
lines are used to reduce the delays and meet the timing 
requirements while driving these heavily loaded signals. 
The memory array is organi^^ed as one row of DRAMs with 
IM words of 72 bits eat:h. Each memory word is packed 
with two 32-bit words from the MidBus and eight Hamming 
check bits. 

Fig. 5 shows the timing for memory read and write oper- 
ations. The 20 address bits required to address the iM-bit 
DRAMs are multiplexed onto the 10-bit memory address 
bus and latched into the DRAM address latches by two 
negative-going strobes. The first strobe, RAS. latches the 
row address. The second strobe, CAS^ subsequently latches 
the column address. For a write operation. WE is brought 
low before CAS, and the data is strobed by CAS. The setup 
and hold times for data to be written to the memory array 
are referenced to the falling edge of CAS. For a read opera- 
tion, WE is held in the high state throughout the memory 
transaction, and data read from the memory array is avail- 
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The Model 825 supports up to seven memory array 



Fig. 4. B!ock diagram of the Model 825 memory board. 
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Fig. S. Timing for memory read and write operations, 



able within the access time from CAS- The semaphore op" 
eration reads 16 bytes from the memory array and clears 
the first four bytes without affecting the other 12 bytes. 

The IM-bit DRAMs support a feature known as mbble 
mode^ Nibble-mode operation allows high-speed serial ac- 
cess to more than one word of memory. By packing two 
32-bit words from the MidBus into each memor>^ word> 
only two serial accesses are required for a Ifi-byte transac- 
tion and only four serial accesses are required for a 32-b>1:e 
transaction. The first word is accessed in the normal man- 
ner with read data being valid at the CAS access time. The 
sequential words are read or written by toggling CAS while 
RAS remains low. The row and column addresses need to 
be supplied for only the first access. Thereafter, the falling 
edge of CAS increments the internal nibble counter of the 
DRAMs and accesses the next word in memory. 

Address Strobe Signal Quality 

DRAM RAS aind CAS inputs function as clock signals and 
must have clean transitions. The assertion of these signals 
is also in the critical timing path for accessing data, so 
minimizing timing skew is important To ensure thai the 
transitions will be smooth, the clock signals are routed to 
the DRAMs in a star configuration. 

There are four CAS drivers for every 72 DRAM chips and 
so each CAS driver mu.st drive 18 DRAMs. The CAS line is 
routed to a location central to the 18 DRAMs. From here 
it is split into six signals. Each of these is routed to a point 
central to the location of three of the DRAMs. The six lines 
are all made the same length to keep the star balanced. 
Three DRAMs are connected to each of these six signals. 



Again to keep the star balanced, the lines connecting the 
DRAMs are all the same length. The four CAS signals coming 
from the memory controller are routed electrically identi- 
cally. The CAS signals are all driven directly by the VLSI 
controller and so each DRAM sees almost exactly the same 
signal. This allowed the drivers and series termination to 
be optimized to give a smooth low-skew signal at a single 
point to ensure that the CAS input signal at all 72 DRAMs 
would be optimized. There is very little CAS timing skew 
between DRAMs, 

The timing for the RAS signal is less critical than for the 
CAS signals, but it Is Important that the transitions be 
smooth and glitch-free. The signal is connected in a star 
configuration but there are only two RAS drivers, so the 
last star connects to six DRAMs instead of three. 

Delay Line 

To conform to MidBus timing, .some cards are required 
to have a delay line. The value of the delay line depends 
on the delay and timing of the bus interface circuits and 
the MidBus buffers. 

The delay line for the Model 825 memory array board 
must be betw^een. 21 and 27 ns. At the lime of the design^ 
reliable surface mount delay lines were unavailable* so 
alternatives were investigated. Lumped LG circuits were 
tried, but it w^as hard to guarantee the delay wnth the wide 
tolerances on the parts [mostly the TTL drivers and receiv- 
ers]. The second alternative was to use a single TTL buffer 
and a long trace to get the needed delay. It was found that 
this was feasible. 

The signal propagation delay for a long printed circuit 
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trace between two grQuxid planes is;^ 



1.017 X Ve, 



ns/ft 



where e^ is the dielectric constant of the insulating medium. 
The dieleclric constant of the prtnted circuit hoBtd material 
ts between 4.4 and 4. ft, which gives a delay of 2,13 to 2.23 
US per foot. The trace was made 115.5 inches Jong to give 
a delay of 20.5 to 21.5 ns- The TTL driver for ihe delay 
line has a delay of 1 to 5 ns* giving a total delay of 21.5 to 
26.5 ns. 

Two of the potential problems with the long trace are 
RFI and the signal coupling into itself. To avoid this the 
trace is sandwiched between two ground planes and runs 
are spaced at least 0.045 inch apart with a 0,01 5- in ground 
trace between runs (sections where the trace loops back on 
itself). 

Powerfall Backup 

The power failure backup system iu the Model 825 is a 
RAM only backup system. If line voltage is lost, the RAM 
is powered from a backup battery supply while the rest of 
the system is shut down. Since the dynamic RAMs require 
little power to retain data, only a relatively small battery 
and backup regulator are needed to keep the memory ^ys- 
tem ahve. When power is restored after an outage there is 
enough information available in the memory to resume 
normal processing after a recovery period. 

To support powerfai! backup, the RAM board is designed 
to power down its interface to the rest of the computer 
cleanly when the failure occurs and to keep the contents 
of memory refreshed during the outage. Power drain on 
the backup supply has been minimized for maximum 
backup time. 

To retain the data, each of the 512 row^ addresses of the 
DRAM cell matrix must he refreshed within every 8-ms 
tune period. During normal operation the Mr d Bus clock i.s 
uned to provide the timing for the refresh state machine. 
However, during a power failure* the MldBus clock is un- 
defined and a secondary refresh clock must be provided 
on the memory hoard. This secondary refresh clock is gen- 
erated with a CMOS 555 timer with a maximum period of 
7. ,"5 microseconds. 

The powerfail sequence is initiated by the falling edge 
of POW FAIL L, which indicates that the input to the MidBus 
power supplies has failed and a powerfail routine should 
be entered. The power supplies remain regulated long 
enough after the falling edge of POW FAIL L to guarantee 
that the cache will be flushed to main memory before the 
falling edge of POW ON. After POW ON falls and any refresh 
cycle in progress is completed, the memory controller 
switches into the battery backup mode of operation. 

While in the battery backup mode of operation, the mem- 
ory controller holds WE and CAS high to prevent inadvertent 
destruction of the memory contents. In addition, the hat tery 
backup circuits are isolated from spurious inputs from the 
primary control section which occur while power is in 
transition. 

DRAM Error (Handling 

The mtiiiory controller chip incorporates error handling 



circuits based on Hamming codes to protect the system 
from errors in the DRAMs on each memory board. The 
32~bit words on the MidBus are packed into 72-bit w^ords 

when written lo the DRAMs- The 72 bits consist of ttvo 
words from the MidBus and eight check bits generated by 
the Hamming circuit. 

On a read from memory, the 72-bit word is presented to 
the Hamming circuits. If the s>Tidrome word generated is 
zero, the word from the DR/\Ms is u^ncormpted and the 
data corrector Ls told to pass the word unaltered. li the 
syndrome word generated is nonzero, the condition of the 
error ( recov era h] e * un reco verabl e , m a ppable/u nma ppab le ) 
will be reported in the STATUS register, the cache line ad- 
dress ^vili be saved in the ERR ADD register, and the syn- 
drome word will be stored in the ERR SYN register. If the 
syndrome word equals one of the 72 valid patterns* a single- 
bit error has occurred, and the data corrector flips the bit 
indicated by the syndrome patlern to recover the data. De- 
tection and correction of slngle^bit errors are transparent 
to the system. 

If a nonvalid error condition exists, a double-bit (or more) 
error has occurred. The memory controller has circuits for 
recovering from many double-bit errors. To use this feature, 
the system software needs to have identified a troublesome 
bit (usually a hard failure) in a bank of memory. After 
identifying it, the system writes the syndrome word of that 
bit into the MAPSYN regis ten and by issuing a CMD MAP 
signal, notifies the memory t:ontroller to suspect that bit 
as bad in a double-bit error. Knowing this, when the non- 
valid condition occurs, the memory controiler will order 
its data corrector to flip that bit and recheck the word, If 
a valid syndrome word is now calculated, the single-bit 
error routine will be invoked. If the syndrome is nut valid, 
the memory controller will notify the system of an unrecov- 
erable error condition. 

Internal Clocks 

The inLenml phase clocks of the Model 825 memory con- 
troller are generated by the circuit described in *'A Preci- 
sion Clocking System" on page 17. The SYNC input that 
circuit requires is generated by an on-chip circuit that effec- 
tively doubles the S. 33-MHz MidBus clock frequency \o 
16,67 MHz, This doubles the number of well-controlled 
clock phases per bus state for better control of DRAM tim- 
ing. Ftg, 6 is a diagram of the 2 x clock generator. 

The basic building block of the 2x clock generator is 
the delay element shown in Fig. 7. The delay element makes 
use of the voltage-controlled resistance characteristic of 
MOSFETs. A capacitor in the delay element is precharged 
when the delay element's input is low. This causes the 
ARM signal to go high. This ARM signal is NANDed with the 
input to generate the output. When the input goes highp 
the output goes low and the capacitor is discharged. When 
the capacitor voltage drops below the threshold of the sense 
FET (pulldown of the first inverter following the RC node 
in Fig. 7), the ARM signal goes low, causing the output to 
go back high. The capacitor's discharge FET is in series 
with a FET controlled by a variable voltage (V^xjisrlr so the 
length of time the output is low can be varied. 

If the variable voltage is set such that the dischiirge time 
plus the delay to disarm the output is one quarter of the 
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Fig. S. 2 X ciock generator generates a sync Signal at fw/ce 
the MidBus ciock frequency. 

MidBus clock period, the inverted outputs of alternate ele- 
ments within a series of delay elements can be ORed to- 
gether to provide a SYNC signal at twice the MidBus clock 
frequency* 

An advance/retard circuit looks at the inverted (positive 
pulse) output of five delay elements connected in series. 
When the output pulse of the .second element (B) starts 
within the output pulse of the fiftli element (E), the pulses 
are longer than desired* The advance/retard circuit in- 
creases the variable voltage V^q^, which decreases the pulse 
width. When (A + B + C -f- D) exists , the pulses are shorter than 
desired^ and Vcon ^s decreased, which increases the pulse 
width. 

A position adjust circuit looks at the rising edge of SYNC 
compared to the falling edge of the input clock. When SYNC 
is late, the position adjust circuit raises a secondary variable 
voltage (VcoNzJ which acts in parallel with Vcon to shorten 



inverted 
Output 



Input 




two preliminary delay element pulses. These pulses have 
a width of T/4-D/2, where T is the MidBus clock period 
and D is the delay in generating SYNC. Thus one SYNC pulse 
is placed at Ty2 and the next one is aligned with the falling 
edge of the input clock. 

Power Subsystem 

Power for the HP 9000 Model 825 Computer, up to 435 W 
total, is provided by a switching mode power supply 
operating at 29 kHz and using bipolar technology. The 
basic design of this power system has evolved over the 
past several years and has been employed with increasing 
sophistication in a number of products/^ In each stage of 
this evolution, improvements have been made in selected 
areas to increase reliability while reducing complexity and 
costs. This incremental approach has allowed the crucial 
compromi.se between using well understood parts and tech- 
nology while still exploiting new ideas and developments 
in the industry- 
Six outputs are provided, including an optional battery- 
backed five-volt output, + 5VS, This secondary output pro- 
vides power to the main memory during a primary power 
failure, allowing the product to recover and resume oper- 
ation. The 12-volt batteries and their charger are housed 
in a separate unit and are cabled to the computer. 

Dc fans are used to cool the product and they are con- 
trolled by the power supply. The fans are operated at low 
speed to minimize audible noise in a typical office environ- 
ment but their speed is increased as external temperatures 
rise above 30'^C. To mainlain critical cooling, the fans are 
also operated w^hile the unit is running on batteries. 
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VLSI-Based High-Performance HP 
Precision Architecture Computers 

The same system processing unit powers two connputef 
systems, one running the MPE XL operating system for 
commercial data processing and one running the HP-UX 
operating system for technical and real-time applications. 

by Gerafd R* Gassman, Michael W. Schrempp, Ayee Goundan, Richard Chin, Robert D. Odineal, and 
Marlin Jones 



THE HP 9000 MODEL 8 5 OS and the HP 3000 Series 
950 are currenlly the largest HP technical and com- 
mercial computer products, respectively > to use the 
new Hewlett-Packard Precision Architeclure. ' and along 
with the HP 9000 Model 825 described in the paper on 
page 25, are the first to realize the architecture in propri- 
etary VLSI technology. The first technical and commercial 
PiP Precision Architecture systems were the HP 9000 Model 
840 and the HP 3000 Series 930, which use commercial 
TTL technology." 

The HP 9000 Model B50S and the HP 3000 Series 950 
are hoth based on the same system processing unit (SPU). 
which consists ol processor, memory^ 1/0, power, and pack- 
aging subsystems. The Model SSfjS/Series 950 processor 
uses the NMOS-IIl VLSI chip set described in the papers 
on pages 4 tind 12. 



The differences between the Model 850S and tlie Series 
950 are primarily in the areas of software and configuration. 
The Model 850S is configured primarily for technical appli- 
cations. It runs HP-UX, HP's version of AT&T's UNLK'^' 
System V operating system with real-time extensions. The 
Series 950 is configured for business applications. It exe- 
cutes MPE XL, a new version of HP's proprietary MPE 
operating system. This provides coinpalibility as well as a 
performance upgrade for the current HP 3000 customer 
base. 

The Model 850S/Series 950 SPU has a single processor, 
u[) to 1 28 M bytes of memory supported from a .single mem- 
ory system, and up to four channel I/O buses. In this paper, 
references are made to larger memory and I/O capacity, 
and to the support of multiple processors. The hardware 
has been designed lo support processor, memory, and I/O 




Fig. 1 . System processing unit of 

the HP 9000 Mode! 850S and HP 
3OQ0 Series 950 Computers. 
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extensions beyond those announced to dale. These features 
are described here solely to illustrate vvhy certain techni- 
cal decisions were made. No implication is made regard- 
ing any eventual product that may offer these hardware 
features. 

Delivering 7 MIPS of processing power, the Model 850S 
Series 950 SPU (Fig. 1 1 is the highest -performing HP Preci- 
sion Architecture SPU developed to date. Containing a 
single-board VLSI processor capable of running either the 
MPE XL or HP-UX operating systems, the SPU is designed 
to fit both commercial and technical computing applica- 
tions. Performance^ user friendliness, system growth, r^l la- 
bility t support ability, and manufacturability were the key 
design focal points. 

SPU Bus Structure 

The key Uj the performance and growth capabilities of 
the SPU is the bus structure. This determines how and 
how fast information is passed from one part of the SPU 
to another. The Model 850S/Series 950 SPU yses a hierar- 
chical three-tier bus structure, as shown in Fig, 2. to achieve 
high performance and allow for future growth. On the first 
tier is the fastest bus, a 54-bit-wide bus called the system 
main bus (SMB),* The second tier in the bus structure 
consists of a pair of 20-Mbyte/s 32-bit-wide buses called 

'Other t3us narr^s used m some producl; titerature are central bus (CTB) for MidBus, system 

memofy bus for system nrtajn £iu$ fSMB). and CIB for channel I/O {CIO) bus. 



MidBuses. * The third tier in the bus stmcture is made up 
of four 5'Mbyte/s HP-standard channel I/O (aO) buses.* 

The Model asoS-^Series 950 SPU. as currently released, 
has four modules connected to the S,MB. These are a state- 
of-the-art NMOS-III VLSI processor with hardware floating- 
point coprocessor on one board, a memory controller, and 
two identical bus converters. The SMB is designed to sup- 
port up to eight modules, leaving room for future product 
releases. The two bus converters connect the SMB to the 
MidBnseSj providing six slots on each of them. Channel 
adapters connect the MidBuses to the CIO buses. The main 
SPU bay supports four CIO buses with five device slots on 
each- 

The hierachical three-tiered bus structure has the power 
and the flex ibiUty to support higher- performance and larger 
configurations that may be released in the future. The bus 
structure also allows the currently released Series 950 and 
Model 850S configurations to be different, each tailored to 
its own market requirements. This two-for-one design is a 
direct result of HP Precision Architecture, which allows a 
flexible bus structure to be implemented in an architectur- 
ally transparent way» and of a well- thought-through design 
that made flexibility of configuration and smooth growth 
a very high priority. 

Processor Subsystem 

The Mode! 8 5 OS/Series 950 processor is a single-board 
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Fig, 2. Model 850SlSerie5 950 

hierarchical system bus structure^ 
The system main bus (SMB) is or) 
the highest leveL Modules on this 
bus are the processor boards, two 
bus converters which conriect the 
SMB to the two MidBuses. ar)d 
memory controiters which connect 
the memory system to the SMB. 
Channef i/0 (CiO) buses are con- 
nected to the MidBuses by chan- 
nei adapters^ 
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Fig. 3. Model 850Si Series 950 processor board. 

impleriHinfation of the CPU, cache. TLB, and bus interface 
functions of the level -one HP Precision Architecture pro- 
cesser.^ A level-one processor supports 16-bit space rEJgis- 
ters for a 48-bit virtual address space. The processor oper- 
ates at a frequency of 27.5 MHz and provides an average 
execution rate of 7.0 MIPS. The Model 850S/Series 950 
processor board also contains a floating-point matli subsys- 
tem capable of providing an average execution rate of 2.8 
million Whetstone BlDs and 0.71 million double-precision 
Lin packs per second. 

Fig. 3 is a photograph of the Model 850S/Series 950 pro- 
cessor board. The board achieves a high level of functional 
integration and performance by making use of V\.Sl tech- 
nology, state-of-the-art commercial RAMs, and precision- 
tuned clock di sin button circuits. The board uses six 
NMOS-IJl VLSI chips developed for the Model 850S/Series 
950 project: one CPU, one TCU (TLB control unit]> two 



LEOHeK 
Display 



CCUs (cache conlral units), one SIU [system interftice unit), 
and one MILJ (math interface unil). The math functions are 
implemented using the floating-point math chips de- 
veloped for the HP 9000 Model 550: ADD [add/subtract), 
MUL (multiply), and DIV (divide]. The Model 850S/Series 
930 processor is equipped with a two-set unified data and 
instruction cache that has a total capacity of 128K bytes. 
The address translation Jookaside buffer (TLB) is split into 
an instruction TLf3 with 2K entries and a datct 1 LB with 
2K entries. The details of tlie functional operation of the 
processor board are described in the paper on page 4, 

Processor Buses 

Fig. 4 siiuws the organization of the various functions of 
the processor board. Three buses are Ejssociated with the 
processor: the math bus. the cache bus, and the system 
main bus (SMB). The math bus is completely self-contained 
in the processor board and interfaces the MILI to the mMth 
chips. The cache bus is also contained witluri the processor 
board and interconnects the SIU, CCUl, CCUO, TCU. CPU, 
and MIU. The SMB connects the SIU to the memor>^ and 
I/O subsystems. The SMB is provided on the SMB connec- 
tor, which also supplies the power ^ the system clocks, and 
the interfaces to tJie support hardware. 

The cache bus and the SMB are both precharge-puIldowTi 
buses. That is. there are two phases to the bus operation. 
Each signal is precharged to a nominal 2.85V in one phase 
[CK2] and conditionally pulled down, or discharged, ir] the 
following phase [CK1) (see Fig. 5), All VLSI chips on the 
bus participate m precharging each signal line. Then one 
or more chips will drive tlie signal line low (logical or 
1. depending on the signal sense). The precharged buses 
provide high performance because the NMOS-III drivers 
can rapidly drive a signal line low. thereby minimizing tlie 
data transfer time on the bus. The sender needs to get the 
bus data ready only jutit before the drive pha.se and the 
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receiver uill have the data immediately after the drive 
phase. Consequently, the p recharge phase causes no perfor- 
mance degradation. To propagate an electrical high value, 
a signaJ is allowed to float at its p recharge value. The re- 
ceiver uses a zero*catching design, which latches a logical 
zero when the signal input falls below the 1.3V minimum 
trip level for at least 2.9 ns. This receiver is described in 
ihe paper on page 4. 

Processor Startup 

rhe Model 850S/Series 950 processor uses a diagnostic 
support processor (DSP) to launch itself at power-up or 
after a reset. The DSP is implemented using an off-the-shelf 
microprocessor, and interfaces to the CPU and CCUO serial 
ports as shown in Fig. 4. The DSP receives its own clock 
(the processor clock divided by four)- A key challenge for 
the DSP design was how to synchronize data transfers be- 
tween the DSP and the CPU or CCUO. Synchronization is 
achieved by using a dnuble-level clocked register design. 
The synchronization clock is received from the CPU (DSP 
SYNC in Fig. 4). 

During a processor reset, the DSP sets up the CPU so 
that the CPU will halt immediately after its internal reset 
is complete. After successfully completing its internal self- 
test, the DSP loads part of the processor dependent cade 
(w^hich consists of processor self-test and inlttalizylion) 
into the cache RAMs via scan paths into CCUD and starts 
the CPU* Successful completion of the processor self-test 
transfers control to the rest of the processor dependent 
code which is located on the processor dependent 
hardware board. This code completes the self-test and ini- 
tializes and boots the system. 

Locating the DSP on the processor board allows localised 
self-test that can identify r^ fciiling processor board uniquely. 
Failures in the path from the [irncessor board to the processor 
dependent hardware board can also be identified. This signif- 
icantly increases supportability by decreasing the mean time 
to repair (MTTR]. 

Processor Board Electrical Design 

The Model 8 50S.' Series 950 processor printed circuit board 
consists of 12 layers: three ground, three voltage, foursignalp 
and top and bottom pad layers. The clock distribution area 
has extra layers of ground and voltage planes to provide 
better noise filtering. The signal layers are always placed 



between ground or voltage planes to provide consistent 
characterisHc impedance. Signal traces are 0.005 inch wide 
and have 0.025-in pitch to minimize crosstalk. Manual place- 
ment and routing were used to route the signals in just four 
layers, reducing crosstalk and minimizing impedance and 
propagation delay mismatches on the clock signals. Signal 
traces typically exhibit about 500 of characteristic imped- 
ance on an unloaded board. The characteristic impedance is 
lower on loaded boards because of the input capacitances of 
the components. 

The electrical design of the processor w-as first based on 
worst*case circuit simulation using Spice* \irhich provided 
the initial data for timing budgets and noise margin. The 
board design was later refined by experiments conducted on 
revisions of the actual board. 

A key challenge was the design of the Vol supply (2.85V 
nominal], which powers the internal VLSI clock circuitry, 
all the bus drivers, and the cache bus termination. The 
processor, memor^^ controller, and bus converter boards 
have individual V^l regulators. 

V^i bypassing w-as especially a concern on the processor 
board, where the noise was particularl}' severe. Several 
types and sizes of bypass capacitors are used to bypass 
various frequency bands: 0.001-/xF, 0,01 -piF. and 0,1 -/iF 
ceramics* 22-^ F and 68-^1^' tantalums, and lOOO-^iF elec- 
trolytics. The voltage plane area on the printed circuit board 
w^as also maximized. 

Since the processor, memory controller, and bus con* 
verter boards have separate V^j^ regulators, any significant 
offset between the outputs (greater than about 150 mil- 
livolts) forces the regulator with the highest value to source 
the other boards* requirements. Having the clock board 
supply a common voltage reference for all the local reg- 
ulators keeps the nominally 2.85V V^i^ supplies within !S0 
mV from board to board. 

One of the initial design concerns had to do with noise 
coupling on the cache bus signal lines. The worst-case 
situation involves an undriven victim trace surrounded by 
driven traces. Additional coupling is introduced by the 
cache hus lerminatiun scheme which uses a resistor net- 
work in a dual in-line package for board density reasons. 
The effect of the terminatitm resistor package is reduced 
by mixing signals of different phases in the same package 
and by usmg a part that has multiple power and ground 
pins and hull I -in bypass capacitance. 
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The worst-case design was verilierl by the coiislruiJiun 
of slow and fast boards. The slow board was conslrycted 
using characterized parts (VLSI chips. RAMs, etcj whose 
performance was within i5|)ecificytion but close to the 
minimum acceptable. Similarly, the fast hoard was con- 
structed using parts that performed near the maximum per- 
formance specificaUon, The slow board was used primarily 
to resolve timing and performance is.sue.St white the fast 
board was primarily used to verify noi.se rnargins and EMI 
performance. These boards proved to be extremely valuable 
in establishing confidence in the design, which had very 
aggressive design goals. 

The [processor board design incorporates several I features 
to minimize EML The outer boundary of the board is pro- 
vided with window frames w^hich make electrical contact 
to the chassis through wiping contacts. This helps contain 
the noise within the board and processor compartment. 

Processor Thermal Design 

The thermal design of the Model 850S/Series 950 proces- 
sor board and its integration into the SPU cooling system 
posed some very challenging design problems, The design 
parameters were to cool nine VLSI ICs dissipating up to 
12 watts each on a single board with a worst-case environ- 
ment of 15,000-foot altitude and 40"^ plus design margins. 
It was desirable to use the same airflow^ used to cool the 
rest of the electronics in the SPU, components with an 
order of magnitude lower power than tbe VLSI ICs. To 
meet this requirement, the thermal resistance of the VLSI 
package needed to ha an order of magnitude lower than 
that of the typical package. 

The design was started at the chip level, by designing a 
metal heat spreader into the ceramic package. The VLSI 
chip is bonded directly to this heat spreader, and a heat 
sink is attached to the other side. Theoretical and finite 
element analysis methods were used in the design of tlic 
heat sink, and a wind tunnel was designed to test pro- 
totypes and correlate the theoretical analysis. 

Once the component level design was fully understood, 
the development was expanded to the board leveL The 
large heat sinks required to cool the VLSI components pre- 
sent enough back pressure to the moving air and their dis- 
tribution on the board is irregular enough that a uniform 
airflow across the board could not be assumed. A wrind 
tunnel large enough to hold a single board was built and 
the airflow studied to determine the exact amounts of cool- 
ing air that each VLSI IC would receive. A thermal test 
board was also designed and built so that the junction 
temperatures of all the ICs could be directly measured. 

Once the design was confirmed at the board level, the 
thermal test board was put into a complete prototype SPU 
to confirm the airflow at ttie system level, and system 
airflows were measured. This test set was finally put into 
an environmental test chamber and tested at elevated tem- 
peratures to verify the complete thermal design. 

System Main Bus 

Up to four processor slots, two memory subsystem slots, 
and two I/O subsystem slots are available for connection 
to the high-performance system main bus (SMB). The SMB 



consists of 64 hits of multiplexed address/data and 17 bits 
of control signals. SMB transactions are split into request 
and return phases. This allows interleaved requests from 
multiple masters to be serviced In parallel. The SMB oper- 
ates at 27.5 MHz and is capable of sustaining a 100-mega- 
byte/s data bandwidth. 

Both the processor and the bus converters [which con- 
nect the SMB to the two MidBusesj can initiate SMB Irans- 
actions as masters. The memory controller is always a slave. 
Upon wnnning arbitration, a master module asserts a 32-bit 
address, seven bits of command/sii:e information, and a 
6"bit tag that identifies it as the master. SMB read transac- 
tions are initiated by a master module to transfer 4, 16, or 
32 bytes from the slave to the master. Table 1 shows all of 
the SMB transactions. The processor issues a 4- byte read 
during a load to I/O space and a 32-hytc-i read during a 
load/store cache miss. The bus converter issues a read when 
the CIO bus channel requires data from memory. Clear trans- 
actions are initiated by the processor or the bus converter 
to gain access to a semaphore. During a semaphore opera- 
tion, the memory controtler clears the first word of the half 
line and returns the old half line to the master. Return and 
return clear transactions are driven by a slave device's re- 
turning data following a read or clear operation, respf^c- 
tively. Read or clear transactions that are 'smart" are ini- 
tiated by the processor during a cache miss and require a 
cache coherency check by any other processors during the 
return transaction from the memory controller. Read or 
clear tran.sactions that are "dumb" arc initiated by the pro- 
cessor or the bus converter during 1/0 or DMA operations 
and do not require a cache coherenc}^ check. 

During SMB write transactions, the master sends 4, 16, 
or 32 bytes of data to the slave. Purge cache, flush cache, 
and purge TLB transactions are broadcast on the SMB to 
implement these instructions. Five virtual indexing bits 
are inserted into the 32-bit SMB real address to allow index- 
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ing into a 64 K- byte cache dyring cache coherency checking. 

To eristue fair access to the SMB for each module, the 
SMB has a circular arbitration scheme based upon a prioriU^ 
chain with a lockout. The arbitration hierarchy is derh^ed 
from the SMB backplane wiring that connects the arbitra- 
tion request output signals (NAR0| to the arbitration, irthihlf 
input signals (NINHj of the lower- priority modules. 

If a module is performing a write, a return of read data. 
or a purge TLB transactioD. data is transferred on sub- 
sequent bus cycles. The NBUSY signal locks out arbitratLOQ 
to tie up the bus for multicycle transactions, and the MOK 
signal ts asserted to indicate that the bus address and data 
are valid. NACK (acknowledge) and NRETRY (retry) are used 
to indicate acceptance or refusal of transactions. Either 
f4ACK or NRETRY is asserted by a cache coherency checking 
processor during a return transaction if a clean or dirty copy 
of the line is detected, respectively. The SMB modules log 
and report parity, addressing, or protocol errors. The SMB 
address and data are protected by a single odd parity bit. 

SMB Electrical Design 

SMB electrical design was extensively modeled to pre- 
dict the interactions of the chips, pin-grid array packages, 
connectors, and boards* Signal coupling, oscillation » di/dt 
noise, power supply noise, and maximum frequency of 
operation were simulated. Worst-case SMB shnulallons 
and measurements \^erified 27,5-MHz operation and an 
adequate precharge level. 

The iulen:onnect model is a complett;. unsealed rep- 
resentation of the longest trace path. Propagation delay and 
impedance values are based on worst-case length and thick- 
ness variations. The chip models and power supply net 
impedeoces are scaled as a compromise between detail and 
Spice simulation time. The model is limited to seven sig- 
nals, two ground pads, and a single power pad. Detailed 
pin- grid array signal models and inductive coupling are 
uicluded only in the driving chip. 

A new connector and backplane design was required for 
the 27.5-MHz SMB. The design goals were to maintain a 
good impedance match between printed circuit boards 
through the SMB connector, minimize signal crosstalk, and 
minimize the SMB length for high-frequency operation. 
The new connector was developed using proven pin and 
socket technology and our current qualified vendor to re- 
duce program risk. PressTit technology allows us to load 
connectors fand therefore boards] from both sides of I he 
backplane to implement a ten-inch-long- 27.B-MHz SMB, 

The design goal for the connector impedance was 50li 
to match the 30fi VLSI package impedance to the 500 
circuit board and backplane impedance, impedance match- 
ing is important to minimize signal reflection and oscilla- 
tion. However, typical pin and socket connectors exhibit 
80-to-tOO-ohm impedances. To reduce connector imped- 
ance and signal crosstalk, the number of ground pins was 
increased, resufting in a 1:1 grounfl-to-signai pin ratio. 
Also, a low-impedance ground return path was added: a 
fifth row was added to the 4-row, 440-pin connector in the 
form of small plates aad flat spring contacts. This creates 
a very short, iow-inductance path with a relatively wide, 
flat surface area. The new connector's impedance of 60 to 
65 ohms allows the SMB to run at 27.5 MHz and prevents 



excessive crosstalk during multiple switching operations. 

Mem on' Subsystem 

The main memory system of the Model 85 OS/Series 950 
is designed for high bandwidth, large ma.ximum memon' 
size, and reliable operation. The main memor^'^ controller 
can support from 16M to 123M byies of memory. Provision 
has been made to alJow the addition of a second memory 
controller with its own memory bus and 16M to 12BM 
b\^es of memor^^: however, thb coniiguratioii is not sup- 
ported at this time. 

Memory Controller Board 

The memon^ controller board provides the interface be- 
tw-een the system main bus (SMB) shared between the pro- 
cessor and I'O and the memory array bus shared between 
the memory controller and the memory array boards. The 
memory controller communicates with from one to eight 
16M-b>^e mem ory array boards over the ASTTL memory 
array bus. 

The heart of the memory controller board is the memory 
controller chip, a proprietary HP VLSI IC fabricated in 
NMOS-IIL This IC incorporates the logic to interface the 
SMB lo the memory array bus, provide error detection and 
correction, conform to HP Precision Architecture require- 
ments, and generate DRAM control timing, all in one 272- 
pin pin-grid array package. The high density of this con- 
troller contributes to the reliable operation of the memory 
system. 

During power failures, data in memory is maintained by 
external batteries for at least fifteen minutes. To maximize 
the amount of battery backup time avfiilable. the memory 
controller IC is not batter>^-backed and TTL on the memory 
controller board handles all memory refresh operations. 

Memory Array Boards 

Each memory array board provides the basic control logic 
and buffering to drive 144 dynamic random-access memory 
[DRAM) ICs. The DKAMs are arranged as two banks, each 
IM words by 64 bits (plus B bits for error correction]. Mem- 
ory data access interleaves between these banks and uses 
the nibble-mode function of the DRAMs. The combination 
of bank interleaving, nibble-mode DRAMs, and a wide data 
bus provides the high bandwidth of the memory system. 

Memory can be accessed in either 16-byte or 32-byte 
transactions. A 32-byte transaction can come from either 
a processor cache miss or the 1/0 system and requires 17 
cycles for read operations and 16 cycles for write operations 
from the memory controller to the selected array, A 16-byte 
transaction comes from the I/O system and requires 12 
cycles for either read or write operations. This timing al- 
lows a maximum sustained performance of 51 megabytes 
per second during 32-byte read cycles and 55 megabytes 
per second during 32-b\1e write cycles. 

To maximize performance, careful attention was paid to 
board and backplane layout, the choice of ITL devices 
used, and the internal design of the memory controller IC. 
Spice simulations w^ere done on the backplane and DRAM 
drivers to minimize both delay and skew. Careful analysis 
of the DRAM and bank sw^itching control signals was done 
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to optimize performance while still permitting a large mem- 
ory size of 128M bytes per memory controller. 

Memory Controller Chip 

The memory controller chip employs a data path organi- 
:^atioii with a central register stack containing the address/ 
data path of the controller. The design, implemented with 
80,000 transistors, consi.sls of two VO bus interfaces, three 
control PL As, and a register stack containing nine circuit 
blocks- Of the 272 pins in the package* 84 are used for the 
SMB interface, 93 for the memory array bus interface, four 
to connect to nine serial scan paths, and 91 fur power, 
groundj and clacks. 

A number of circuit blocks are unique lu the memory 
con trailer. There are comparators to map incoming addresses 
to various address ranges, queues to buffer addresses and 
data, and error detection/correction circuitry for the SMB 
and memory^ array bus addresses and data. To sustain 
maximum throughput to memory, 64-bit buses are widely 
employed in the data path and interfaces. 

The memory controller services reads and writes to main 
memory as initiated on the SMB. In addition, the memory 
controller responds as an I/O device to configure memory 
and return status. The controller can buffer two read trans- 
actions and one write transaction, Each transaction can be 
either a 16-byte or a 32-byte data transfer The controller 
buffers the data for one read request and one w^rite request. 
The controller's write queue is logically an extension of 
the processor's cache. Internally, the controller initiates 
memory refresh sequences \d main memory. 

The memory conlrfiller operates as tw^o independent 
units controlling the SMB and memory array board inter- 
faces » respectively. Internally, the controller arbitrates for 
use of the queues and buffers within the chip. The control- 
ler can simultaneously process an SMB read/write request 
or return read data while processing a memory array bus 
ready write to memoiy^ or memory refresh sequence* 

This partition ideally maximizes memory throughput 
during heavy traffic conditions. When two memory reads 
are buffered, the controller is able to start the second mem- 
ory read even though the first memory read's data is still 
in the internal data buffers. The second transaction is al- 
lowed to proceed if the first read transaction is able to 
return data to the SMB before the .second read's data be- 
comes available. Lf the read buffer remains occupied, the 
second read is aborted and restarted. 

In any computer, data integrity is paramount. Each 64-bit 
memory word has eight check bits associated with it. These 
bits allow the memory system to detect and correct all 
single-bit errors in any word of memory. Should a single-bit 
error occurs the incorrect data and location are stored in a 
memory controller register for operating system use or diag- 
nostic logging. These eight extra check bits also allovt^ for 
the detection of all double-bit errors in any 64-bit word. 
Double-bit errors are not correctable. 

Parity is generated and checked between the memory 
controller and the selected memory array on memory ad- 
dresses. This will detect otherwise unreported transient 
errors that could destroy data at random memory locations. 



I/O Subsystem 

The Model 850S/Series 950 I/O subsystem is designed 
to the specifications of HP Precision I/O Architecture. * The 
main design feature of this architecture is its transparency 
at the architected level. To software, the I/O system is a 
uniform set of memory mapped registers independent of 
which bus they are physically located on. 

Although the architecture prescribes a simple and uni- 
form I/O system software interface, the hardware is allowed 
tremendous flexibility. In particular, the 10 system can 
include any number of dissimilar buses interconnected by 
transparent bus converters* The transparent bus converters 
make the boundaiy between adjacent buses invisible to 
software, automatically compensating for differences in 
bus speed or protocoh The mapping between buses on the 
Model 850S/Series 95tl is accomplished primarily through 
the use of bus converters. 

The architecture differentiates between HP Precision 
Architecture buses and other buses. An HP Precision Ar- 
chitecture bus supports the HP Precision Art:hitecture stan- 
dard transactions and can be connected to other HP Preci- 
sion Architecture buses through transparent bus convert- 
ers. Other buses can be connected to an HP Precision Ar- 
c:hitecture system through foreign bus adapters, w^hich are 
not transparent, but instead have an architected software 
interface. The Model 850S/Series 950 takes advantage of 
bus converters where an interface to an existing bus (such 
as HP's CIO bus] is required. 

The I/O system of the Model 850S/Series 950 relies heavi- 
ly on the same custom NMOS-IH VLSI technology used in 
the system's processor. 7*wo custom iCs were developed 
for the I/O subsystem: a bus converter chip, which imple- 
ments a subset of the bus converter functionality, and a 
CIO bus channel adapter chip, which implements the com- 
plete translation of the Mid Bus protocol to the CIO bus 
protocol. The use of NMOS VLSI technology in these cir- 
cuits made possible their implementation in a reasonable 
size and at a much lower cost than alternative technologies. 

Bus Converter 

The function of the SMB-to-MidBus bus converter is to 
convert transactions between the SMB and the TTL-signal- 
level Mid Bus. The bus converter consists of a single board 
containing one custom NMOS- ill VLSI bus converter chip 
and several TTL buffer chips. 

As the first ijus converter to be implemented in an HP 
Precision Architecture system, the Model 85 OS/Series 950 
bus converter had a large influence on the development of 
the bus converter definition for the architecture. Much of 
the bus converter architecture was developed in parallel 
with the bus converter design, and the architecture bene- 
fited from the insights and experience of the bus converter 
implementation team. 

Since transactions can originate on either the SMB or 
the MidBus, there are two sets of control centers within 
the bus converter chip, each of %vhk:h is associated w4th 
one of the two bus interfaces- Communication between the 
tw^o interfaces is facilitated by data/address queues and an 
array of transaction state latches. The algorithms of the bus 
converter are designed to maximize throughput of the most 
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frequent operations, especiatiy writes from I/O modules on 
the MidBus to memory on the SMB, 

Testability is an important aspect of the bus converter 
design, and the test engineer was an integral member of 
tlie design team* The bus converter follows the conven- 
tional NMOS-tfl VLrSl test methodology' using shift chains, 
single-step, and a debug port to allow diagnostic access to 
almost every storage node within the chip. The bus con- 
verter is capable of scanning 977 bits of state information 
from seven scan paths routed through the functional blocks. 

Channel Adapter 

The channel adapter is a bus adapter module that inter- 
faces the MidBus to the QO bus, which is a standard HP 
I/O bus- The channel adapter performs all the necessary 
protocol conversions between the MidBus and the Ifi-bit, 
five-megabyte/second peak bandwidth CIO bus. The chan- 
nel adapter consists of a single board containing one custom 
NMOS-ID VLSI channel adapter chip, several TTL buffer 
chips, ROMs containing code for I/O drivers and self-test, 
and miscellaneous support logic- 

The channel adapter allows full compatibility with all 
existing HP CIO 1/0 cards, as well as additional HP CIO 
cards presently in development. Although the CIO bus pro- 
tocol differs from HP Precision 1/0 Architecture in many 
ways, the foreign bus adapter maps all of the necessary 
CIO functions into the standard register interface through 
which it communicates with the I/O system. En accordance 
with the CIO bus protocol, the channel adapter serves as 
a central time-shared DMA controller on the CIO bus. The 
channel adapter is the initiator of all CIO bus transactions, 
and it is the arbitrator that maximizes the efficient use of 
the CIO bus bandwidth. The channel adapter provides data 
buffering and address translation as it transfers data be- 
tween the I/O modules on the CIO bus and the memory 
modules on other buses within the system. The chaimel 
adapter also translates interrupts and error messages into 
the protocol used by the HP Precision Architecture 1/0 
system^ By handling all normal DMA transfers and the 
majority of error conditions in complete autonomyn the 
channel adapter can greatly reduce the processor overhead 
required to operate the CIO bus. Except in the rarR error 
case that requires software inter\^ention» the channel adap- 
ter appears to the system as a set of standard DMA adapter 
modules conforming to the HP Precision Architecture speci- 
fications for an I/O module. 

System Clock 

The accurate generation and distribution of a 27.5-MH2: 
clock signal is crucial to the performance of the Model 
850S/Series 950. The clock signal originatos on the clock 
board and is shaped, amplified, and distributed to eight 
individual slots on the system main bus (SMB). While the 
Model SSOS/Series 950 can use up to four SMB slots, the 
clock distribution network must support the full eight SMB 
slots to provide for future expansion. Each SMB board has 
its own local driver circuitry which then distributes the 
clock signal to individual VLSI ICs on the board. The reduc* 
tion and control of skew in the clock system was a major 
challenge to the design team and required tight tolerances 



on many aspects of the design. 

The system clock originates on the clock board at TTL 
levels. A single hexadecimal NOR TTL buffer is used to 
drive the discrete transistor clock sync circuits and the 
TTL refresh clocks. The use of a single buffer minimizes 
skew between the \T^S! memory" controller and TTL logic 
on tke memory controller board ivhich provides battery- 
backed refresh signals. The discTete clock sync circuitry 
provides the signals that become the main system clock. 

Conversion betiveen the the TTL clock levels and the 
analog levels of the clock sync circuitry is done by an npn 
differential amplifier {Fig, 6), The output of the differential 
amplifier is then fed to an emitter follower, which drives 
three npn/pnp push-pull pairs which drive the clock SYNC 
signals out to receiver circuitry on up to eight different 
boards on the SMB. The emitter follower's output is the 
one absolute time reference point for the clock generation 
and distribution system. 

The clock SYNC signals enter the 50O backplane traces 
through source- terminated transmission lines. Each re- 



TTL CJock 




SM8 Board 



Fig, 6. Clock distribution system. 
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ceiver board has an npn emitter follower receiver which 
then drives an npn emitter follower predriver. The pre- 
driver controls an npn/pnp pushypull pair which drives 
the clock signal through a termination and deskewing net- 
work lo each IC. 

The clock deskewing network minimizes circuit vari- 
ations and increases performance. Each network is adjusted 
at the factor>^ by adding or subtracting delay lines by means 
of jumpers^ All boards are set to the same standard, so tlie 
customer can expect a clock system that is opUraized for 
high ptirformance but will not require periodic field tuning 
and will accept boards manufactured at any HP site. 

Each VLSI IC has its own internal circuitry that develops 
two phases of nonover lapped clock signals for internal use. 
The SYMC signal from the discrete clock circuitry is only 
used as a timing reference; hence the name "clock SYNC." 
In the Model 850S/Series 95Q clock system only ihe rising 
edge of the SYNC signal has any significance. 

For a given IC process and machine layout many vari- 
ables, such as device and backplane propagation delays, 
are fixed. Thus the difference in time between the clock 
seen at one point in the system and the clock seen at another 
point, or clock skew, can limit a given machine's maximum 
frequency of operation. In the Model S50S/Series 950 the 
limiting effects of clock skew showed up on both the cache 
bus and the system memory bus. Reduction of this skew 
was crucial to system performance. 

Skew was reduced by careful design, layout, and testing 
of backplanes and printed circuit boards. Time intervals 
of tens of picoseconds were calculated and measured . High- 
frequency second-order effects such as the effects of via 
capacitance had to be understood to minimize the differ- 
ences between clock circuits on different boards. The solu- 
tion has resulted in a clock system that can route a clock 
SYNC signal to three to eight SMB boards and a maximum 
of 28 different VLSI ICs with a total skew of less than 800 ps. 

Control, Diagnostic, and Power Subsystem 

One of the CIO bns channel adapter slots has been cus- 
tomized to allow the installation of the access port card. 
This card provides operator access to the SPU, remote sup- 
port of the SPU via modem connection, and remote operator 
access to the SPU. Operator control of the SPU is provided 
through an operator's console. This console can be located 
away from the SPU for efficient computer room layout. A 
control panel located on top of the SPU contains a subset 
of the operator controls and provides a keyed switch for 
operator access security. 

The control panel provides a visual indication of the 
SPU's operational status and an uncomplicated, physical 
interface to start and control the system. It provides system 
self- test information during the boot process and system 
functionality during norma i operation through a four-digit 
hexadecimal display. Power system diagnostic information 
is supplied by LEDs. All diagnostic information is com- 
bined into three basic machine conditions (normal opera- 
tion, operaticm with an operator warning, and non operat- 
ing) which are displayed through three indicators (green, 
red. and yellow) visible from all sides of the SPU. An ad- 
ditional high- visibility display indicates whether remote 



operator access is enabled. 

Design of the control panel for electromagnetic compati- 
bility was especially difficult since the control panel must 
contain any electromagnetic interference (EMI) generated 
in the high-performance logic system, and must protect the 
circuitry from electrostatic discharge (BSD). Mechanical 
and electrical design teams worked together to meet these 
stringent requirements, Interface circuits are filtered to re- 
move nurmal-mode noise before cabling to the control 
paneh The filters also guard the internal circuits against 
ESD-induE;ed transients^ The mechanical design minimizes 
the control panel's ¥SD entry points by using light pipes 
and a molded plastic enclosure. The inner surface of the 
enclosure is zinc-coated and electrically connected to the 
chassis to act as an EMI shield and a conductive path for 
ESD, 

The Model 850S/Series 950 power system delivers 2.4 
Idln watts of dc power to the processor, memorys I/O, and 
support subsystems. Ac power is filtered and converted to 
300 Vdc in the ac unit. This 300V dc power is then used 
to power eight dc-to-dc switching power supplies, which 
provide the dc voltages required by the electronics. High- 
current power is delivered through a solid copper bus bar, 
and lower-current voltages are supplied via cables. Some 
cr illegal voltages are regulated on the boards where they are 
needed. 

SPU Product Design 

Packaging 

The Model 850S/Series 950 is packaged for the interna- 
tional EDP room environment, where operators work on 
the SPU from remote terminals and have only intermittent 
contact with the SPU itself. When there is contact with the 
SPU, the operator is usually standing. For this reason, the 
SPU height and control panel location were designed to 
be a good fit for the middle 90% of the international popu- 
lation. Additionally, the primary status indicators and the 
SPU itself are designed to be viewed from all sides. 

The enclosures [front and back doors, side panels, and 
top covers) of the SPU are made of injection molded struc- 
tural foam plastic. This provides an excellent appearance, 
with crisp lines and consistent detailing. The u.se of molded 
plastic also allowed the design of features for quick, easy 
access to the machine, aerodynamic venting for more effi- 
cient cooling, elimination of sightlines into the machine, 
and sound absorption* 

The processor and I/O cardcages incorporate RFI sMeld- 
ingj fixtureless backplane assembly, air filter, fan plenum 
attachment features, and space for the maximum number 
of boards lo be installed. Similarly, the molded fan plena, 
system frame, and power supply rack all integrate many 
functions. This level of integration allows very efficient 
manufacturing and simple field installation and access. 

The package is key to providing the growth and upgrade 
potential of the product. Using two back-to-back cardcages 
allows the devices on the SMB to be placed close enough 
together that space is left for additional SMB devices to be 
added in the future. All the major subsystems in the SPU 
are near other similar subsystems. This allows the use of 
conunon cooling and EMI shielding systems, and 
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minimizes the number of parts required in the package. 

Cooling is simplified by the duai-cardcage design. All of 
the boards in the system, including the processors, are 
arranged vertically and side by side, like books on a book- 
shelf. These are in the middle of the SPU and cooling fans 
are underneath. The fans are mounted in plastic fan trays 
and pull cooU clean air across the components in the sys- 
tem. The plastic trays and simple card arrangemenl com- 
bine to provide easy manufacturing and servicing of ihe 
cooling system. 

This complex integrated package design was possible 
because of extensive interaction between the packaging 
engineers and the electrical engineers earW in the develop- 
ment of the product. The result is an integrated package 
that minimizes floor space and complexity, without sac- 
rificing manufacturability or serviceability. 

System Growth 

A tiesign goal was that the basic Model 8 5 OS/Series 950 
SPU hardware design be able to support at least two sub- 
stantial performance upgrades, including larger memory 
and I/O configurations* for possible future HP product re- 
leases. The eiectrical, firmware, and product design ele- 
ments have been designed to support this goal without 
adding a significant factory cost burden to the initially 
released product. 

The performance levels of the buses (Fig. 2) and their 
support capabilities are key to providing growth potential. 
The high SMB bandwidth can support four additional mod- 
ules of equal or higher performance, ensuring possible pro- 
cessor performance and memory expansion growth paths. 
The two MidBuses can support high-performance I/O de- 
vices or additional VLSI channel adapters, allowing the 
possibility of external 1/0 expansion. The bus structure 
was thoroughly analyzed, modeled, and margin tested to 
ensure the long-term stability required for growth. 

Carf^jl'ul attention tu configure I ion and bu*i k^ngth require- 
ments produced an internal configuration that is logical 
and supports upgrade goals. The SMB has to he short for 
performance reasons, so SMB device.s are on hotli s\des of 
the CPU/memory backplane, cutting the maxioiuin bus 
length in half. The hus converters distribute the MidBuses 
on a separate I/O backplane parallel to the CPU/memory 
backplane. A power distribution bus bar is sandwiched 
between the two backplanes. This efficient basic layout 
provides a very robust configuration with minimum space 
and cost penalties. Slots for three possible future additional 
SMB devices are found on the CPU side of the cardcage. 
Slots for a second memory controller for a second memory 
bus supporting eight more memory arrays are there for 
possible future releases, This design meets the high perfor- 
mance and upgrade goals important to our customers. 

ReKability 

The reliiibility of the Model 850S/Series 950 SPU is ex- 
pected to be greatly improved over previous SPUs. The 
VLSI technology contributes to this improvement by a large 
reduction in the number of parts in the CPU area alone, 
F(jr comparison, the HP 3000 Series 70 takes eight Model 
S SOS/Series 950'si>ce boards for the processor, whereas the 
Model 850S/Series 950 uses only one. The VLSI memory 



controUer and bus converter greatly reduce parts count. 
Parts counts are also reduced in other areas. 

The design also includes features lo improve reliability 
further as the product matures. An excellent example is 
that the memor\' system was designed to take ad^-^antage 
of lM-b\'1e DR^^Ms for performance and reliability reasons 
from the beginning. By the time the cost of those chips 
made a ISM-byte board economical, the board design was 
already done. There are many other areas in the cmxenl 
design that will enable further reliability improvements. 

Supportabilfty 

SPU support is composed of two categories: initial and 
add-on installation, and on-going support. Both of these 
categories are addressed by the design. 

Support ability goals were set early in the development 
process and reviews were conducted during all phases of 
the development cycle to evaluate the ability of an HP 
Customer f^ngineer to isolate and replace failed assemblies 
quickly and accurately. For example, there are no config- 
uration switches or jumpers on any assemblies. Instead, 
aU assemblies are configured automatically during power- 
up and those required for initial system boot are tested 
automatically during the boot process. Captive fasteners 
are used extensively to speed up unit removal and replace- 
ment. Where captive fasteners could not be employed, the 
number of screw sizes w^as minimized to simplify the reas- 
sembly process- Front connectors have been eliminated 
from all assemblies, and cabling is reduced to a minimum. 

Fault diagnosis uses a four-level approach, with each 
level capable of diagnosing all hardware necessary to load 
the next level of diagnostics. At power-on, the processor 
board initiates the first level by running a self-test on itself. 
After this test is passed ^ a second level of self-test is in- 
voked, testing all the hardware in the boot path. The third 
level of diagnostics can now be run. and all the boards in 
the system can be tested. After this, the operating system 
can be loaded. Then the fourth level of diagnostics, a .set 
of on-Une diagnosticsT can be run, 

Manufacturability 

Throughout the development of the Model 850S/Series 
950 SPU. design engineers and manufacturing engineers 
worked together to optimize the manufacturabilUy of the 
SPU. Early in the project, the SPU design team set goals 
for part count, part number count, number of vendors, as- 
sembly time, and percent autoinsert ability of components. 
These goals were chosen lo minimi;ce overhead cost and 
cycle time and Increase quality. In addition, standard ship- 
ping packages and shipping methods were targeted and 
used to help set the external si^e constraints for the 
machine. 

Design for manufacturability guidelines were used by all 
engineers, and design reviews were conducted to improve 
the design. As the development of the SPU progressed, 
these critical parameters were tracked and the designs re- 
viewed with the ititent of improving the performance in 
these areas. Progress toward the goals was reviewed and 
tracked on a monthly basis, and a complete assembly 
analysis was performed before each major design phase of 
the project. The result is a product with high manufacturing 
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efficiency, which translates ialo lower prices and better 
value to the customer. 

Shipment methods and their effect on installation lime 
were important considerations. Two method"; of shipment 
were developed. For shipments within the cxmttnental 
United States, tlie Model B50S- Series 950 is shipped as a 
single unit in a padded van. For overseas i^hipmenfs, the 
Model 850S/S lories 950 SPU is put on a single pa I lei which 
fits into standard aircraft cargo containers. This two- 
method shipping strategy reduces shipping costs for 
domestic shipments by eliminating parts. 

Installation is now very simple. No mechanical or elec- 
trical assembly is required at the customer's site. The instal- 
lation time is reduced to less than half that of the HP 3000 
Series 70. 

This commitment to designing for manufacturability re- 
sulted in a significant improvement over existing products. 
All measures show improvement over current high -end 
products. Part count, part numher count, and assembly 
time have been reduced by 50%. Manufacturing cost has 
been reduced by 25%. and autoinsertabi Hty has been in- 
creased by 33%. 
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